Optimal approximate matrix product in terms of stable rank

2015·arXiv

Abstract

1 Introduction

Much recent work has successfully utilized randomized dimensionality reduction techniques to speed up solutions to linear algebra problems, with applications in machine learning, statistics, optimization, and several other domains; see the recent monographs [HMT11, Mah11, Woo14] for more details. In our work here, we give new spectral norm guarantees for approximate matrix multiplication (AMM). Aside from AMM being interesting in its own right, it has become a useful primitive in the literature for analyzing algorithms for other large-scale linear algebra problems as well. We show applications of our new guarantees to speeding up standard algorithms for generalized regression and low-rank approximation problems. We also describe applications of our results to k-means clustering (discovered in [CEM15]) and nonparametric regression [YPW15].

In AMM we are given A, B each with a large number of rows n, and the goal is to compute some matrix is “small”, for some norm . Furthermore, we would like to compute C much faster than the usual time required to exactly compute

Work on randomized methods for AMM began with [DKM06], which focused on i.e., Frobenius norm error. They showed by picking an appropriate sampling matrix Π

with good probability, if , we mean the rows of Π are independent, and each row is all zero except for a 1 in a random location according to some appropriate distribution. If ) can be computed in O(mdp) time once Πare formed, as opposed to the straightforward O(ndp) time to compute

The Frobenius norm error guarantee of Eq. (1) was also later achieved in [Sar06, Lemma 6] via a different approach, with some later optimizations to the parameters in [KN14, Theorem 6.2]. The approach of Sarl´os was not via sampling, but rather to use a matrix Π drawn from a distribution satisfying an “oblivious Johnson-Lindenstrauss (JL)” guarantee, i.e. a distribution satisfying the following condition for some

Such a matrix Π can be taken with )) [JL84]. Furthermore, one can take Π to be a Fast JL transform [AC09] (or any of the follow-up improvements [AL13, KW11, NPW14, Bou14, HR16]) or a sparse JL transform [DKS10, KN14] to speed up the computation of ΠA and ΠB. One could also use the Thorup-Zhang sketch [TZ12] combined with a certain technique of [LBKW14] (see [Woo14, Theorem 2.10] for details) to efficiently boost success probability.

Other than Frobenius norm error, the main other error guarantee investigated in previous work is spectral error. That is, we would like to be small, where denotes the largest singular value of M. If one is interested in applying to some set of input vectors then this type of error is the most meaningful, since being small is equivalent to for any x. The first work along these lines was again by [DKM06], who gave a procedure based on entry-wise sampling of the entries of A and B. The works [DMM06, SS11] showed that row-sampling according to leverage scores also provides the desired guarantee with few samples.

Then [Sar06], combined with a quantitative improvement in [CW13], showed that one can take a Π drawn from an oblivious JL distribution with ) denotes rank and r = r(A)+r(B). Then for Π with ), with probability at least 1

As we shall see shortly via a very simple lemma (Lemma 1), a sufficient deterministic condition implying Eq. (3) is that Π is an -dimensional subspace spanned by the columns of A, B. The notion of a subspace embedding was introduced by [Sar06].

-subspace embedding for satisfies Eq. (3) with . This is equivalent to (1 + preserves norms of all vectors in the subspace spanned by the columns U.

)-oblivious subspace embedding (OSE)

Fast subspace embeddings Π, i.e. such that the products Πcan be computed quickly, are known using variants on the Fast JL transform such as the Subsampled Randomized Hadamard Transform (SRHT) [Sar06, LWM07, Tro11, LDFU13] (also see a slightly improved analysis of the SRHT in Section A.2) or via sparse subspace embeddings [CW13, MM13, NN13, LMP13, CLMCoh16]. In most applications it is important to have a fast subspace embedding to shrink the time it takes to transform the input data to a lower-dimensional form. The SRHT is a construction of a Π such that ΠA can be computed in time O(nd log n) (see Section A.2 for details of the construction). The sparse subspace embedding constructions have some parameter m rows and exactly s non-zero entries per column, so that ΠA can be computed in time ) is the number of non-zero entries, and there is a tradeoff in the upper bounds between m and s.

An issue addressed by the work of [MZ11] is that of robustness. As stated above, achieving Eq. (3) requires Π be a subspace embedding for an r-dimensional subspace. However, consider the case when A (and similarly for B) is of high rank but can be expressed as the sum of a low-rank matrix plus high-rank noise of small magnitude, i.e., where is very small but has high (even full) rank. One would hope the noise could be ignored, but standard results require Π to have a number of rows at least as large as r, regardless of how small the magnitude of the noise is. Another case of interest (as we will see in Section 3) is when A and B are each of high rank, but their singular values decay at some appropriate rate. As discussed in Section 3, in several applications where AMM is not the final goal but rather is used as a primitive in analyzing an algorithm for some other problem (such as k-means clustering or nonparametric regression), the matrices that arise do indeed have such decaying singular values.

The work [MZ11] remedied this by considering the ˜) always, but can be much less if A has a small tail of singular values. Let ˜). Among other results, [MZ11] showed that to achieve Eq. (3) with good probability, one can take Π to be a random (scaled) sign matrix with either ) rows. As noted in follow-up work [KVZ14], both the 1dependence and the log(d + p) factor are undesirable. In their data-driven low dimensional embedding application, they wanted a dimension m independent of the original dimensions, which are assumed much larger than the stable rank, and also wanted lower dependence on 1. To this end, [KVZ14] defined the and showed ) rows suffice for ˜Here is the nuclear norm, i.e., sum of singular values of is the sum of squared singular values, it is straightforward to see that ˜) always. Thus there is a tradeoff: the stable rank guarantee is worsened to nuclear rank, but dependence on 1is improved to quadratic.

We show switching to the weaker ˜nr guarantee is unnecessary by showing quadratic dependence on 1holds even with stable rank. This answers the main open question of [MZ11, KVZ14].

To state our results in a more natural way, we rephrase our main result to say that we achieve

for an arbitrary 1, and we do so by using subspace embeddings for O(k)-dimensional subspaces in a certain black box way (which will be made precise soon) regardless of the ranks of A, B.

Remark 1. Note that our previously stated main contribution is equivalent, since one could set ) to arrive at the conclusion that subspace embeddings for )-dimensional subspaces yield the guarantee in Eq. (3). Alternatively one could obtain the Eq. (4) guarantee via Eq. (3) with error parameter

Henceforth, we use the following definition.

Definition 2. For conforming matrices satisfies the ()-approximate spectral norm matrix multiplication property ((is random and satisfies (-AMM with probability 1for any fixed A, B, then we say Π satisfies (

Our main contribution: We give two different characterizations for Π supporting (both of which imply ()-AMM Π having The first characterization applies to any OSE distribution for which a moment bound has been proven for (which is true for the best analyses of all known OSE’s). In this case, we show a black box theorem: any ()-OSE provides ()-AMM. Since matrices with subgaussian entries and )-OSE’s, our originally stated main result follows. This result is optimal, since [NN14] shows any randomized distribution over Π with m rows having the ()-AMM property must have ) (the hard instance there is when A = B = U has orthonormal columns, and thus rank and stable rank are equal).

Our second characterization identifies certain deterministic conditions which, if satisfied by Π, imply the desired ()-AMM property. These conditions are of the form: (1) Π should preserve a certain set of )) different subspaces of varying dimensions (all depending on on the ranks of A, B) with varying distortions, and (2) for a certain two matrices in our analysis, left-multiplication by Π should not increase their operator norms by more than an O(1) factor. These conditions are chosen carefully so that matrices with subgaussian entries and satisfy all conditions simultaneously with high probability, again thus proving our main result while also suggesting that the conditions we have identified are the “right” ones.

Due to the black box reliance on the subspace embedding primitive in our proofs, Π need not only be a subgaussian map. Thus not only do we improve on m compared with previous work, but also in terms of the general class of Π our result applies to. For example given our first characterization, not only does it suffice to use a random sign matrix with Ω() rows, but in fact one can apply our theorem to more efficient subspace embeddings such as the SRHT or sparse subspace embeddings, or even constructions discovered in the future. That is, one can automatically transfer bounds proven for the subspace embedding property to the ()-AMM property. Thus, for example, the best known SRHT analysis (in our appendix, see Theorem 9) implies (for ) rows. For sparse subspace embeddings, the analysis in [Coh16] implies ) suffices with ) non-zeroes per column of Π. The only reason for the log k loss in m for these particular distributions is not due to our theorems, but rather due to the best analyses for the simpler subspace embedding property in previous work already incurring the extra log k factor (note being a subspace embedding for a k-dimensional subspace is simply a special case of ()-AMM where A = B = U has k orthonormal columns). In the case of the SRHT, this extra log k factor is actually necessary [Tro11]; for sparse subspace embeddings, it is conjectured that the log k factor can be removed and that actually suffices to obtain an OSE [NN13, Conjecture 14]. We also discuss in Remark 3 that one can set Π to be Πhas subgaussian entries with ) rows, and Πother fast OSE (such as the SRHT or sparse subspace embedding), and thus one could obtain the best of both worlds: (1) Π has ) rows, and (2) can be applied to any is the (fast) time to apply Πis the number of rows of ΠFor example, by appropriate composition as discussed in Remark 3, Π can have support multiplying Π

We also observe the proof of the main result of [BSS12] can be modified to show that given any A, B each with n rows, and given any 2), there exists a diagonal matrix Π ) non-zero entries, and that can be computed by a deterministic polynomial time algorithm, achieving ()-AMM. The original work of [BSS12] achieved Eq. (3) with the sum of ranks of A, B. The work [BSS12] stated their result for the case A = B, but the general case of potentially unequal matrices reduces to this case; see Section 4. Our observation also turns out to yield a stronger form of [KMST10, Theorem 3.3]; also see Section 4.

As mentioned, aside from AMM being interesting on its own, it is a useful primitive widely used in analyses of algorithms for several other problems, including k-means clustering [BZMD15, CEM15], nonparametric regression [YPW15], linear least squares regression and low-rank approximation [Sar06], approximating leverage scores [DMMW12], and several other problems (see [Woo14] for a recent summary). For all these, analyses of correctness for algorithms based on dimensionality reduction via some Π rely on Π satisfying AMM for certain matrices in the analysis.

After making certain quantitative improvements to connections between AMM and applications, and combining them with our main result, in Section 3 we obtain the following new results.

1. , consider the problem of computing . It is standard that Moore-Penrose pseudoinverse. The bottleneck here is computing A popular approach is to instead compute ˜, i.e., the minimizer of . Note that computing (Π) only takes a smaller amount of time. We show that if Π satisfies (, and is also an O(1)-subspace embedding for a certain r(A)-dimensional subspace (see Theorem 3), then

where is the orthogonal projection onto the column space of orthonormal columns forming a basis for the column space of A. The punchline is that if the regression error has high actual rank but stable rank only on the order of r(A), then we obtain multiplicative spectral norm error with Π having fewer rows. Generalized regression is a natural extension of the case when B is a vector, and arises for example in Regularized Least Squares Classification, where one has multiple (non-binary) labels, and for each label one creates a column of B; see e.g. [CLL10] for this and variations.

2. Low-rank approximation: We are given and integer 1, and we want to compute . The Eckart-Young theorem implies is obtained by truncating the SVD of A to the top k singular vectors. The standard way to use dimensionality reduction for speedup, introduced in [Sar06], is to let then compute ˜return ˜, the best rank-k approximation of ˜A, instead of (it is known ˜can be computed more efficiently than ; see [CW09, Lemma 4.3]). We show if Π satisfies (for , and is a (1/2)-subspace embedding for the column space of

The punchline is that if the stable rank of the tail is on the same order as the rank parameter k, then standard algorithms from previous work for Frobenius multiplicative error actually in fact also provide spectral multiplicative error. This property indeed holds for any k for popular kernel matrices in machine learning such as the gaussian and Sobolev kernels (see [RHV11] and Examples 2 and 3 of [YPW15]), and low-rank approximation of kernel matrices has been applied to several machine learning problems; see [GM13] for a discussion.

We also explain in Section 3 how our result has already been applied in recent work on dimensionality reduction for k-means clustering [CLM15], and how it generalizes results in [YPW15] on dimensionality reduction for nonparametric regression to use a larger class of embeddings Π.

1.1 Preliminaries and notation

We frequently use the singular value decomposition (SVD). For a matrix consider the compact SVD each have orthonormal columns, and Σis diagonal with strictly positive diagonal entries (the singular values of A). We assume (Σdenote the orthogonal projection operator onto the column space of A. We use span(A) to refer to the subspace spanned by A’s columns.

Often for a matrix as the best rank-k approximation to A under Frobenius or spectral error (obtained by writing the SVD of A then setting all (Σ). We often denote . For matrices with orthonormal columns, such as denotes the matrix formed by removing all but the first k columns of U. When A is understood from context, we often write to denote (

2 Analysis of matrix multiplication for stable rank

First we record a simple lemma relating subspace embeddings and AMM.

Proof. First, without loss of generality we may assume = 1 since we can divide both sides of Eq. (3) by be a matrix whose columns form an orthonormal basis for E. Then note for any x, y we can write

Lemma 1 implies that if A, B each have rank at most r, it suffices for Π to have Ω(In the following two subsections, we give two different characterizations for Π to provide (AMM, both only requiring Π to have Ω() rows, independent of r.

2.1 Characterization for (k, ε, δ)-AMM via a moment property

Here we provide a way to obtain ()-AMM for any Π whose subspace embedding property has been established using the moment method, e.g. sparse subspace embeddings [MM13, NN13, Coh16], dense subgaussian matrices as analyzed in Section A.1, or even the SRHT as analyzed in Section A.2. Our approach in this subsection is inspired by the introduction of the “JL-moment property” in [KN14] to analyze approximate matrix multiplication with Frobenius error. The following is a generalization of [KN14, Definition 6.1], which was only concerned with d = 1.

Definition 3. A distribution )-OSE moments if for all matrices with orthonormal columns,

Note that this is just a special case of bounding the expectation of an arbitrary function of . The arguments below will actually apply to any nonnegative, convex, increasing function of , but we restrict to moments for simplicity of presentation. The acronym “OSE” refers to oblivious subspace embedding, a term coined in [NN13] to refer to distributions over Π yielding a subspace embedding for any fixed subspace of a particular bounded dimension with high probability. We start with a simple lemma.

Lemma 2. Suppose D satisfies the (-OSE moment property and A, B are matrices with (1) the same number of rows, and (2) sum of ranks at most 2d. Then

Proof. First, we apply Lemma 1 to A and B, where U forms an orthonormal basis for the subspace span{columns(A), columns(B)}, showing that

Therefore

Then, just as [KN14, Theorem 6.2] showed that having OSE moments with d = 1 implies approximate matrix multiplication with Frobenius norm error, here we show that having OSE moments for larger d implies approximate matrix multiplication with operator norm error.

be any distribution over matrices with n columns with the (-OSE moment property for some . Then, for any A, B,

Proof. We can assume A, B each have orthogonal columns. This is since, via the full SVD, there exist orthogonal matrices each have orthogonal columns. Since neither left nor right multiplication by an orthogonal matrix changes operator norm,

Thus, we replace and similarly for B. We may also assume the columns A are sorted so that . Henceforth we assume A has orthogonal columns in this sorted order (and similarly for B, with columns ). Now, treat A as a block matrix in which the columns are blocked into groups of size k, and similarly for B (if the number of columns of either A or B is not divisible by k, then pad them with all-zero columns until they are, which does not affect the claim). Let the spectral norm of the ith block of denote the spectral norm of the ith block as . These equalities for A, B hold since their columns are orthogonal and sorted by norm. We claim (and similarly for ). To see this, let the blocks of Also, for i > 1 we have

We now discuss the implications of applying Theorem 1 to specific OSE’s.

Subgaussian maps: In Section A.1 we show that if Π has independent subgaussian entries and ) rows, then it satisfies the ())) OSE moment property. Thus Theorem 1 applies to show that such Π will satisfy (

SRHT: The SRHT is the matrix product Π = diagonal with independent 1 entries on the diagonal, H is a “bounded orthonormal system” (i.e. an orthogonal matrix in )), and the m rows of S are independent and each samples a uniformly random element of [n]. Bounded orthonormal systems include the discrete Fourier matrix and the Hadamard matrix; thus such Π exist supporting matrix-vector multiplication in O(n log n) time. Thus when computing Π, this takes time O(nd log n) (by applying Π to A column by column). In Theorem 9 we show that the SRHT with log(1) satisfies the ())-OSE moment property, and thus provides ()-AMM. Interestingly our analysis of the SRHT in Section A.2 seems to be asymptotically tighter than any other analyses in previous work even for the basic subspace embedding property, and even slightly improves the by now standard analysis of the Fast JL transform given in [AC09].

Sparse subspace embeddings: The sparse embedding distribution with parameters m, s is as follows [CW13, NN13, KN14]. The matrix Π has m rows and n columns. The columns are independent, and for each column exactly s uniformly random entries are chosen without replacement and set to independently; other entries in that column are set to zero. Alternatively, one could use the CountSketch [CCF04]: the m rows are equipartitioned into s sets of size m/s each. The columns are independent, and in each column we pick exactly one row from each of the s partitions and set the corresponding entry in that column to uniformly; the rest of the entries in the column are set to 0. Note ΠA can be multiplied in time )), and thus small s is desirable.

It was shown in [MM13, NN13], slightly improving [CW13], that either of the above distributions satisfies the (2)-OSE moment property for = 1, and hence ((though this particular conclusion follows easily from [KN14, Theorem 6.2]). It was also shown in [Coh16], improving upon [NN13], that they satisfy the ())-OSE moment property, and hence also (conjectured that for ) should suffice [NN13, Conjecture 14].

Remark 2. The work [Coh16] does not explicitly discuss the OSE moment property for sparse subspace embeddings. Rather, [Coh16] bounds ). Note though for 0 and integer by Taylor expansion of the exponential. Setting , [Coh16] thus implies by choice of , which is the ())-OSE moment property.

Remark 3. Currently there appears to be a tradeoff: one can either use Π such that ΠA can be computed quickly, such as sparse subspace embeddings or the SRHT, but then the number of rows m is at least k log k. Alternatively one could achieve the optimal ) using subgaussian Π, but then multiplying by Π is slower: However, settling for a tradeoff is unnecessary. One can actually obtain the “best of both worlds” by composition, i.e. the multiplication Π = Πof two matrices both supporting AMM. Thus Πcould be a fast matrix providing AMM to low (but suboptimal) dimension, and Πa “slow” (e.g. subgaussian) matrix with the optimal ) number of rows. In fact one can even set Π = ΠΠis the sparse subspace embedding with is the SRHT, and Πa subgaussian matrix. Then ΠA will have the desired ) rows and can be computed in time )); see Section A.3 for justification.

2.2 Characterization for (k, ε, )-AMM via deterministic events

Here we provide a different characterization for achieving ()-AMM. Without loss of generality we assume maxeach be minimal such that for some sufficiently large constant (which will be set in the proof of Theorem 2). It was shown that ) in the proof of Theorem 3.2 (i.b) in [MZ11]. Write the SVDs

For 0 as set of all columns of whose corresponding squared singular values (from Σ) are at least 1be the set of min{k, w} largest singular vectors from , and define similarly. Define denote the dimension of span(), and note the are non-decreasing.

Let ˜after rounding up to the nearest power of 2. Group all i with the same value of ˜into groups . For example if for 16 then the ˜are 4be the common value of ˜

denote the dimension of span(). Then the above summation is at most . It thus suffices to bound the second summand by 4k.

Note that we can find a basis for among the columns of with corresponding squared singular value at least 1is the number of columns of the basis and the number of columns of in the basis. Then by averaging, if the inequality of the lemma statement does not hold then either . Without loss of generality assume the former.

Consider an arbitrary column of , and suppose it has squared singular value in the range [1Then it is in span(Its contribution to is therefore 1which is at most 2. It follows that , since the squared Frobenius norm of is at most k. This is a contradiction to Now we prove the main theorem of this subsection.

(1) If -subspace embedding for the subspace spanned by the columns of . Otherwise if , then for each 0 embedding for span(

Proof. We would like to bound

(2). Furthermore by the definition of Θ . Note condition (1) implies that Π is a (1/2)-subspace embedding for the subspace spanned by columns of (by taking i maximal). Thus by both conditions we have

It only remains to bound , then we are done by condition (1) and Lemma 1. Thus assume . Then we have

Let x, y be any unit norm vectors. Write is the restriction of x to coordinates for which the corresponding squared singular values in ΣSimilarly define

We bound the first sum, as bounding the second is similar. Note . Therefore by property (1) and Lemma 1,

where Eq. (8) used that the corresponding v value in property (1) is at most 2. Returning to Eq. (7) and applying Cauchy-Schwarz and Lemma 3,

We thus finally have that Eq. (6) is at most (2is at most sufficiently large constants.

Now we discuss some implications of Theorem 2 for specific Π.

) rows forming an orthonormal basis for the span of the columns of . Property (1) is satisfied for every i in fact with = 0. Property (2) is also satisfied since , and similarly for bounding

Example 2: Let Π be a random matrix with independent entries that are subgaussian with variance 1/m. For example, the entries of Π may be N(0, 1/m), or uniform in ). As mentioned in Section A.1, such Π is an -subspace embedding for a k-dimensional subspace with failure probability . For property (1) of Theorem 2, if we would like Π to be an -subspace embedding for a subspace of dimension at most k, which holds with failure probability then we would like Π to be an -subspace embedding for span() for all 1 ) simultaneously. Note max), and thus max. Thus for a subspace under consideration D, we have failure probability for our choice of m. By construction every is at least increase at least geometrically. Thus our total failure probability is, by a union bound, Property (2) of Theorem 2 is satisfied with failure probability by [RV13, Theorem 3.2].

3 Applications

Spectral norm approximate matrix multiplication with dimension bounds depending on stable rank has immediate applications for the analysis of generalized regression and low-rank approximation problems. We also point out to the reader recent applications of this result to kernelized ridge regression [YPW15] and k-means clustering [CEM

3.1 Generalized regression

Here we consider generalized regression: attempting to approximate a matrix B as AX, with A of rank at most be the orthogonal projection operator to the column space of A, with ; then the natural best approximation will satisfy

This minimizes both the Frobenius and spectral norms of . A standard approximation algorithm for this is to replace A and B with sketches Π, then solve the reduced problem exactly (see e.g. [CW09], Theorem 3.1). This will produce

Below we give a lemma on the guarantees of the sketched solution in terms of properties of Π.

1. satisfies the (-approximate spectral norm matrix multiplication property for

2. is a (1/2)-subspace embedding for the column space of A (which is implied by Π satisfying the spectral norm approximate matrix multiplication property for with itself)

Proof. We may write:

So far, we have shown that the error depends only on (with the third line following from the fact that the sketched regression is exact on ). Now, in the last line, we can see that the two terms lie in orthogonal column spaces (the first in the span of A, the second orthogonal to it). For matrices X and Y with orthogonal column spans, so this is at most

Spectral submultiplicativity then implies the first term is at most

is 1, since is orthonormal. ((Πis at most 2, since Π is a subspace embedding for is at most

3.2 Low-rank approximation

Now we apply the generalized regression result from Section 3.1 to obtain a result on low-rank approximation: approximating a matrix A in the form ˜and both ˜have orthonormal columns. Here, we consider a previous approach (see e.g. [Sar06]):

1. Let

2. Let be the orthogonal projection operator to the row space of

3. Compute a singular value decomposition of ˜A, and keep only the top k singular vectors. Return the resulting low rank approximation ˜

It turns out computing ˜can be done much more quickly than computing ; see details in [CW09, Lemma 4.3].

Let be the exact k-truncated SVD approximation of A (and thus the best rank-k approximation, in the spectral and Frobenius norms), and let be the top k column singular vectors, and be the tail.

1. satisfies the (-approximate spectral norm matrix multiplication property for

2. is a (1/2)-subspace embedding for the column space of

Proof. Note that this procedure chooses the best possible (in the spectral norm) rank-k approximation to A subject to the constraint of lying in the row space of S. Thus, the spectral norm error can be no worse than the error of a specific such matrix we exhibit.

onto

This is rank-k by construction, since it is multiplied by , and it lies in the row space of since that is the rightmost factor. On the other hand, it is an application of the regression algorithm to A where the optimum output is (since that is the projection of A onto the space of Plugging this into Eq. (9) gives the desired result.

3.3 Kernelized ridge regression

In nonparametric regression one is given data , and the goal is to recover a good estimate for the function are scalars, the are vectors, and the are independent noise, often assumed to be distributed as mean-zero gaussian with some variance . Unlike linear regression where ) is assumed to take the form for some vector in nonparametric regression we allow to be an arbitrary function from some function space. Naturally the goal then is to recover some ˜f from the data so that, as n grows, the probability that ˜f is “close” to increases at some good rate.

The recent work [YPW15] considers the well studied problem of obtaining ˜is small with high probability over the noise w, where one uses the definition

The work [YPW15] considers the case where comes from a Hilbert space H of functions f such that f is guaranteed to be square integrable, and the map ) is a bounded linear functional. The function ˜f is then defined to be the optimal solution to the Kernel Ridge Regression (KRR) problem of computing

for some parameter . It is known that any H as above can be written as the closure of the set of all functions

over all and vectors for some positive semidefinite kernel function k. Furthermore, the optimal solution to Eq. (11) can be expressed as ) for some choice of weight vector , and it is known that will be small with high probability, over the randomness in is chosen appropriately (see [YPW15] for background references and precise statements).

After rewriting Eq. (11) using Eq. (12) and defining a matrix arrives at a reformulation for KRR of computing

which can be computed in ) time. The work [YPW15] then focuses on speeding this up, by instead computing a solution to the lower-dimensional problem

and then returning as ˜f the function specified by the weight vector ˜. Note that once various matrix products are formed (where the running time complexity depends on the Π being used), one only needs to invert an matrix thus taking ) time. They then prove that is small with high probability as long as Π satisfies two deterministic conditions (see the proof of Lemma 2 [YPW15, Section 4.1.2], specifically equation (26) in that work):

• Π is a (1/2)-subspace embedding for a particular low-dimensional subspace

) for a particular matrix B of low stable rank (in [YPW15]). Note

and thus it suffices for Π to provide the approximate matrix multiplication property for the product has low stable rank.

The first bullet simply requires a subspace embedding in the standard sense, and for the second bullet [YPW15] avoided AMM by obtaining a bound on directly by their own analyses for gaussian Π and the SRHT (in the gaussian case, it also follows from [RV13, Theorem 3.2]). Our result thus provides a unifying analysis which works for a larger and general class of Π, including for example sparse subspace embeddings.

3.4 k-means clustering

In the works [BZMD15, CEM15], the authors considered dimensionality reduction methods for k-means clustering. Recall in k-means clustering one is given , as well as an integer 1, and the goal is to find

That is, the n points can be partitioned arbitrarily into k clusters, then a “cluster center” should be assigned to each cluster so as to minimize sums of squared Euclidean distances of each of the n points to their cluster centers. It is a standard fact that once a partition n points into clusters is fixed, the optimal cluster centers to choose are the centroids of the points in each of the k partitions, i.e.

One key observation common to both of the works [BZMD15, CEM15] is that k-means clustering is closely related to the problem of low-rank approximation. More specifically, given a partition , define the

Let -means problem can be rewritten as computing

where P ranges over all partitions of {1, . . . , n} into k sets. It is easy to verify that the non-zero columns of are orthonormal, so is the orthogonal projection onto the column space of . Thus if one defines S as the set of all rank at most k orthogonal projections obtained as , then the above can be rewritten as the constrained rank-k projection problem of computing

One can verify this by hand, since the rows of A are the points is the centroid of the points in i’s partition in P.

satisfies certain technical conditions to be divulged soon, then if ˜

then

the best rank-k approximation to , i.e. the top r right singular vectors of A are the columns of

as its first as its lowern rows. Then [CEM15, Lemma 10] states that Eq. (14) implies Eq. (15) as long as

One can easily check , so the stable rank ˜r(B) is at most 3k. ThusEq. (16) is implied by the (32)-AMM property for , and our results apply to show that Π can be taken to have ) rows to have success probability 1for Eq. (16). Obtaining Eq. (17) is much simpler and can be derived from the JL moment property (see the proof of [KN14, Theorem 6.2]).

Without our results on stable-rank AMM provided in this current work, [CEMdifferent analysis, avoiding [CEM15, Lemma 10], which required Π to have rows (note the product between ) instead of the sum).

4 Stable rank and row selection

As well as random projections, approximate matrix multiplication (and subspace embeddings) by row selection are also common in algorithms. This corresponds to setting Π to a diagonal matrix S with relatively few nonzero entries. Unlike random projections, there are no oblivious distributions of such matrices S with universal guarantees. Instead, S must be determined (either randomly or deterministically) from the matrices being embedded.

There are two particularly algorithmically useful methods for obtaining such S. The first is importance sampling: independent random sampling of the rows, but with nonuniform sampling probabilities. This is analyzed using matrix Chernoff bounds [AW02], and for the case of k-dimensional subspace embedding or approximate matrix multiplication of rank-k matrices, it can produce ) samples [SS11]. The second method is the deterministic selection method given in [BSS12], often called “BSS”, choosing only ) rows. This still runs in polynomial time, but originally required many relatively expensive linear algebra steps and thus was slower in general; see [LS15] for runtime improvements.

The matrix Chernoff methods can be extended to the stable-rank case, making even the log factor depend only on the stable rank, using “intrinsic dimension” variants of the bounds as presented in Chapter 7 of [Tro15]. Specifically, Theorem 6.3.1 of that work can be applied with each n summands each equal to is random with the probability of choosing a particular row i equal to

We here give an extension of BSS that covers low stable rank matrices as well.

When is the identity, this is just the original BSS result. It is also stronger than Theorem 3.3 of [KMST10], implying it when A is the combination of the rowsfrom that theorem statement with an extra column containing the costs, and a constant . The techniques in that paper, on the other hand, can prove a result comparable to Theorem 5, but with the row count scaling as rather than

Proof. The proof closely follows the original proof of BSS. However, for simplicity, and because the tight constants are not needed for most applications, we do not include [BSS12, Claim 3.6] and careful parameter-setting.

At each step, the algorithm will maintain a partial approximation ) (the matrix “A” in [BSS12]), with S beginning as 0. Additionally, we keep track of upper and lower “walls” and ; in the original BSS these are just multiples of the identity. The final S will be returned by the algorithm (rescaled by a constant so that the average of the upper and lower walls is

We will maintain the invariants

These are the so-called upper and lower potentials from BSS. We also require is positive definite. Note that unlike [BSS12], here we do not apply a change of variables making the identity (to avoid confusion, since that would change the Frobenius norm). This is the reason for the slightly more complicated form of the potentials.

In the original BSS, were always scalar multiples of the identity (here, without the change of variables, that would correspond to always being multiples of ). [BSS12] thus simply represented them with scalars. Like BSS, we will increase by multiples of the key difference from BSS is that they are initialized to multiples of the identity, rather than . In particular, we may initialize . This is still good enough to get the spectral norm bounds we require here (as opposed to the stronger multiplicative approximation guaranteed by BSS).

We will have two scalar values, , depending only on ; they will be set later. One step consists of

1. Choose a row and a positive scalar (via increasing the i component of S).

2. Add

We will show that with suitable values of obeying the invariants there always exists a choice of i and t such that the invariants will still be true after the step is complete. This corresponds to Lemmas 3.3 through 3.5 of BSS.

For convenience, we define, at a given step, the matrix functions of y

The upper barrier value, after making a step of and increasing

Applying the Sherman-Morrison formula, and cyclicity of trace, to the rank-1 update can be rewritten as

Since the function ) is a convex function of y with derivative

we have ). Then the difference between the barrier before and after the step is at most

Constraining this to be no greater than zero, rewriting in terms of and pulling it out gives

Furthermore, as long as is at least this, Z will remain below , since the barrier must approach infinity as t approaches the smallest value passing For the lower barrier value after the step, we get

Again, applying Sherman-Morrison rewrites it as

Again, due to convexity the increase in the barrier from raising is at most times the local derivative. The difference in the barrier after the step is then at most

This is not greater than zero as long as

the upper bound. To show that there is at least one choice of i for which this holds, we look at the sum of all the lower bounds and compare to the sum of all the upper bounds. Summing the former over all i gets

Finally, note that

and the lower barrier implies , implying that as long as

Thus, we can always make a step as long as are set so that

Before the first step, can be initialized as , respectively. If the algorithm is then run for steps, we have:

) satisfies the requirements of the output for 3(one can simply apply this argument for 3). Furthermore, all the computations required to verify the preservation of invariants and compute explicit ts can be performed in polynomial time.

This obtains more general AMM as a corollary:

Corollary 1. Given two matrices A and B, each with n rows, and an , there exists a diagonal matrix nonzero entries satisfying the (-AMM property for A, B. Such an S can be computed by a polynomial-time algorithm.

Proof. Apply Theorem 5 to a matrix X consisting of the columns of

Note that X satisfies the conditions of that theorem, since concatenating the sets of columns at most adds the squares of their spectral and Frobenius norms. (is a submatrix of 2 max(), so its spectral norm is upper bounded by the spectral norm of that matrix, which in turn is bounded by the guarantee of Theorem 5.

Acknowledgments

We thank Jaros�law B�lasiok for pointing out the connection between low stable rank approximate matrix multiplication and the analyses in [YPW15].

References

[AC09] Nir Ailon and Bernard Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009.

[AL13] Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Transactions on Algorithms, 9(3):21, 2013.

[AW02] Rudolf Ahlswede and Andreas J. Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569–579, 2002.

[Bou14] Jean Bourgain. An improved estimate in the restricted isometry problem. Geometric Aspects of Functional Analysis, 2116:65–70, 2014.

[BSS12] Joshua D. Batson, Daniel A. Spielman, and Nikhil Srivastava. Twice-Ramanujan sparsifiers. SIAM J. Comput., 41(6):1704–1721, 2012.

[BZMD15] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2015.

[CCF04] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004.

[CEMMichael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and M˘ad˘alina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th ACM Symposium on Theory of Computing (STOC), 2015. Full version at http://arxiv.org/abs/1410.6801v3.

[CLLPei-Chun Chen, Kuang-Yao Lee, Tsung-Ju Lee, Yuh-Jye Lee, and Su-Yun Huang. Multiclass support vector classification via coding and regression. Neurocomputing, 73(7-9):1501–1512, 2010.

[CLMMichael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 6th Annual Conference on Innovations in Theoretical Computer Science (ITCS), pages 181–190, 2015.

[Coh16] Michael B. Cohen. Simpler and tighter analysis of sparse oblivious subspace embeddings. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2016.

[CW09] Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 205–214, 2009. Full version at http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/cw09.pdf.

[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 81–90, 2013. Full version at http://arxiv.org/abs/1207.6365v4.

[DKM06] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J. Comput., 36(1):132–157, 2006.

[DKS10] Anirban Dasgupta, Ravi Kumar, and Tam´as Sarl´os. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341–350, 2010.

[DMM06] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127–1136, 2006.

[DMMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012.

[FR13] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkh¨auser, 2013.

[GM13] Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 567–575, 2013.

[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.

[HR16] Ishay Haviv and Oded Regev. The restricted isometry property of subsampled Fourier matrices. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2016.

[HW71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Statist., 42(3):1079–1083, 1971.

[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.

[KMN11] Daniel M. Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit johnson-lindenstrauss families. In Proceedings of the 14th International Workshop on Randomization and Computation (RANDOM), pages 628–639, 2011.

[KMST10] Alexandra Kolla, Yury Makarychev, Amin Saberi, and Shang-Hua Teng. Subgraph sparsification and nearly optimal ultrasparsifiers. In Proceedings of the Forty-second ACM Symposium on Theory of Computing, STOC ’10, pages 57–66, New York, NY, USA, 2010. ACM.

[KN14] Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1):4, 2014.

[KVZ14] Anastasios T. Kyrillidis, Michail Vlachos, and Anastasios Zouzias. Approximate matrix multiplication with application to linear embeddings. CoRR, abs/1403.7683, 2014.

[KW11] Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property. SIAM J. Math. Anal., 43(3):1269–1281, 2011.

[LBKW14] Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P. Woodruff. Improved distributed principal component analysis. In Proceedings of the 27th Annual Conference on Advances in Neural Information Processing Systems (NIPS), pages 3113–3121, 2014.

[LDFU13] Yichao Lu, Paramveer Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized Hadamard transform. In Proceedings of the 26th Annual Conference on Advances in Neural Information Processing Systems (NIPS), 2013.

[LMP13] Mu Li, Gary L. Miller, and Richard Peng. Iterative row sampling. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 127–136, 2013.

[LP86] Fran¸cois Lust-Piquard. In´egalit´es de Khintchine dans Acad. Sci. Paris, 303(7):289–292, 1986.

[LPP91] Fran¸cois Lust-Piquard and Gilles Pisier. Noncommutative Khintchine and Paley inequalities. Ark. Mat., 29(2):241–260, 1991.

[LS15] Yin Tat Lee and He Sun. Constructing linear sized spectral sparsification in almost linear time. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 250–269, 2015.

[LWMEdo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.

[Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.

[MM13] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 91–100, 2013.

[MZ11] Avner Magen and Anastasios Zouzias. Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1422–1436, 2011.

[NN13] Jelani Nelson and Huy L. Nguy˜ˆen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 117–126, 2013.

[NN14] Jelani Nelson and Huy L. Nguy˜ˆen. Lower bounds for oblivious subspace embeddings. In Proceedings of the 41st International Colloquium on Automata, Languages, and Programming (ICALP), pages 883–894, 2014.

[NPW14] Jelani Nelson, Eric Price, and Mary Wootters. New constructions of RIP matrices with fast multiplication and fewer rows. In Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), 2014.

[RHV11] Nima Reyhani, Hideitsu Hino, and Ricardo Vig´ario. New probabilistic bounds on eigenvalues and eigenvectors of random kernel matrices. In Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence (UAI), pages 627–634, 2011.

[RV13] Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:1–9, 2013.

[Sar06] Tam´as Sarl´os. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143–152, 2006.

[SS11] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011.

[Tro11] Joel A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal., 3(1–2):115–126, 2011.

[Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.

[TZ12] Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM J. Comput., 41(2):293– 331, 2012.

[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.

[YPW15] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. CoRR, abs/1501.06195, 2015.

Appendix A OSE moment property

In the following two subsections we show the OSE moment property for both subgaussian matrices and the SRHT.

In this section, we show the OSE moment property for distributions satisfying a JL condition, namely the JL moment property. This includes matrices with i.i.d. entries that are mean zero and subgaussian with variance 1/m.

Definition 4. [KMN11] Let D be a distribution over property if for all of unit norm,

The following theorem follows from the proof of Lemma 8 in the full version of [CW13]. We give a different proof here inspired by the proof of [FR13, Theorem 9.9], which is slightly shorter and more self-contained. A weaker version appears in [Sar06, Lemma 10], where the size bound on for a constant 1 instead of simply

with orthonormal columns be arbitrary. Then there exists a set , each of norm at most 1 such that

Proof. We will show that if supis some positive real. Define is symmetric,

Let be a finite and for every of unit norm there exists a such that . As we will see soon, there exists such a of size at most (1 + 2will show that if Π satisfies the JL condition on with error that is, (1

Let x be a unit norm vector that achieves the sup above, i.e. . Then, letting y be the closest element of

Rearranging gives )), which is

Now we must show that we can take . The following is a standard covering/packing argument for bounding metric entropy. Imagine packing as many radius-(as possible into , centered at points with at most unit norm and such that these balls do not intersect each other. Then these balls all fit into a radius-(1 + ball centered at the origin, and thus the number of balls we have packed is at most the ratio of the volume of a (1 + to the volume of a 2 ball, which is ((1 + . Now, take those maximally packed radius-(2) balls and double each of their radii to be radius . Then every point in the unit ball is contained in at least one of these balls by the triangle inequality, which is exactly the property we wanted from is just the centers of these balls). To see why every point is in at least one such ball, if some of unit norm is not contained in any doubled ball then a 2-ball about x would be disjoint from our maximally packed 2 balls, a contradiction.

Lemma 4. If D satisfies the (-JL moment property, then D satisfies the (2moment property

Proof. By Theorem 6, there exists a subset of at most 9points such that

It is known that if D is a distribution over ) and for Π the entries of Π are independent subgaussians with mean zero and variance 1/m, then D has the ()))-JL moment property [KMN11]. Thus such a matrix has the (log(1)))-OSE moment property for by Lemma 4.

Recall the SRHT is the matrix Π = (1a power of 2 where D has diagonal entries that are independent and uniform in is the unnormalized Hadamard transform with as elements of the vector space sampling matrix. That is, the rows of S are independent, and each row has a 1 in a uniformly random location and zeroes elsewhere. A similar construction is where matrix with being independent Bernoulli random variables each of expectation m/n (so that, in expectation, S selects m rows from HD). We will here show the moment property for this latter variant since it makes the notation a tad cleaner, though the analysis we present holds essentially unmodified for the former variant as well.

Our analysis below implies that the SRHT provides an -subspace embedding for d-dimensional subspaces with failure probability )). This is an improvement over analyses we have found in previous works. The analysis in [Tro11] only considers constant ) and for these settings achieves m = O((d+log n) log d), which is still slightly worse than our bound for this setting of (our bound removes the log n and achieves any 1/ poly(d) failure probability with the same m). The analysis in [LDFU13] only allows failure probabilities greater than . They show failure probability is achieved for which is also implied by our result if (which is certainly the case in applications for the SRHT to be useful, since otherwise one could use the identity matrix as a subspace embedding). The reason for these differences is that previous works operate by showing HDU has small row norms with high probability over D; since there are n rows, some logarithmic dependence on n shows up in a union bound. After this conditioning, one then shows that S works. Our analysis does not do any such conditioning at all. Interestingly, such a lossy conditioning approach was done even for the case d = 1 [AC09]. As we see below, these analyses can be improved (essentially the log n terms that appear from the conditioning approach can be very slightly improved to log m).

Our main motivation in re-analyzing the SRHT was not to improve the bounds, but simply to clearly demonstrate that the SRHT satisfies the OSE moment property. The fact that our moment based analysis below (very slightly) improved m was a fortunate accident. Before we present our proof of the OSE moment property for the SRHT, we state a theorem we will use. For a random matrix M, we henceforth use to denote (is the Schatten-p norm, i.e. the norm of the singular values of M.

Theorem 7 (Non-commutative Khintchine inequality [LP86, LPP91])be fixed real matrices and be independent Rademachers. Then

We will also make use of the Hanson-Wright inequality.

We now present our main analysis of this subsection.

Theorem 9. The SRHT satisfies the (-moment property for as long as

Proof. For a fixed with orthonormal columns, we would like to bound

Since

by H¨older’s inequality. Also, let be the rows of HDU, as column vectors, so that

Note also , so the identity matrix is the expectation, over , of the right hand side of Eq. (21) for any D. Thus we are left wanting to bound

where the are identically distributed as the but independent of them. Below we use to denote (. Also we assume p is an integer multiple of 4, so that for real symmetric . Thus for () independent Rademachers,

implying that for some fixed constant C > 0, we have 0. This implies that Q is at most the larger root of the associated quadratic equation, i.e. , or equivalently

where ˜is the matrix with ( ˜. Of particular importance for us is the identity ˜. Then by Eq. (26) and Theorem 8,

so that

which when combined with Eq. (25) gives

Thus the OSE moment property is satisfied by our choices of m, p in the theorem statement.

As discussed in Remark 3, to obtain both a good number of rows for Π as well as fast multiplication for Π, one may wish to set Π as the composition Π = Πhas the correct number ) of rows (e.g. a matrix of subgaussian entries), whereas Πmaps to a small but suboptimal number of rows (e.g. the SRHT) but supports fast embedding to compute Πshow here that composing maps each supporting AMM yields a final map also giving AMM.

As discussed in Corollary 1, without loss of generality we can assume A = B. Also, as discussed in Remark 1, we can focus on achieving Eq. (3) where the number of rows of Π should depend on the stable rank ˜r and not rank r of A. The key is to note the following simple triangle inequality:

The results of this work show that to achieve the desired , it suffices that the number of rows of Πneed only depend on ˜r and not r, as desired. The trouble is that for , the number of rows of Πwill need to depend on the stable rank ˜. Furthermore, the error will be . Thus, we must obtain good bounds on both ˜To achieve this, note

and

above), then indeed we have ) by Eq. (28). Also, [KN14, Theorem 6.2] implies with probability 1 as long as Π comes from a distribution satisfying the ()-JL moment property for some 2 (which is just the (moment property in the terminology of this work). If this holds, then Eq. (29), and thus ˜), as desired. Then overall, we have that the left hand side of Eq. (27) is at most as desired, in which both Πneed only provide AMM with error for matrices both of stable rank

Remark 4. An even slicker argument that works in the case when Πare both drawn from distributions satisfying the ()-OSE moment property is to observe that the distribution of the product Πitself satisfies the OSE moment property. Indeed, letting for a scalar random variable Z, and letting denote a matrix with orthonormal columns,

In the first line we used that when A = B in Lemma 2, Πneed only satisfy the OSE moment property with parameter k instead of 2k (since then the span of the columns of both A and B has dimension at most k). Thus the distribution of the product Πsatisfies the (moment property.

Designed for Accessibility and to further Open Science