b

DiscoverSearch
About
Optimal approximate matrix product in terms of stable rank
2015·arXiv
Abstract
Abstract

We give two different characterizations of the type of dimensionality-reducing map Π that can be used for spectral error approximate matrix multiplication (AMM). Both imply a random data-oblivious Π with  m = O(˜r/ε2) rows suffices, where ˜r is the maximum stable rank, i.e. squared ratio of Frobenius and operator norms, of the matrices being multiplied. This answers the main open question of [MZ11, KVZ14], and is optimal for any random oblivious map.

Both characterizations apply to a general class of random Π, and one even to deterministic Π. Recall an (ε, δ, d)-oblivious subspace embedding (OSE)distribution D over matrices Π  ∈ Rm×nis such that for any d-dimensional linear subspace  E of Rn, P(∥(ΠU)T (ΠU)−I∥ > ε) < δ, wherethe columns of U form an orthonormal basis for E. In one characterization, we show if this tail bound was established via the moment method, then to obtain AMM it suffices that D be an (ε, δ, 2˜r)-OSE. That is, we show being an OSE for dimension (i.e. rank) k implies black box AMM for matrices of stable rank k. Once this is shown, our main result is then just a simple corollary of the fact that subgaussian maps with  m = Ω((d + log(1/δ))/ε2) rows are (ε, δ, d)-OSE’s. Also, for all known OSE’s, the best analyses indeed are via the moment method (or tools such as matrix Chernoff, which themselves also imply moment bounds). Thus our theorem can be applied to a much more general class of sketching matrices than just the subgaussian sketches in [MZ11, KVZ14], in addition to achieving better bounds. This includes fast subspace embeddings such as the Subsampled Randomized Hadamard Transform [Sar06, LWM+07] or sparse subspace embeddings [CW13, MM13, NN13, Coh16], or even constructions that may be developed in the future, to show that rank bounds in their analyses in previous work can automatically be replaced with stable rank. Our second characterization identifies certain deterministic conditions which if satisfied imply the AMM guarantee. We show these conditions are sufficiently precise to yield optimal results for subgaussian maps and even a deterministic Π such as the truncated SVD.

Our main theorem, via connections with spectral error matrix multiplication proven in previous work, implies quantitative improvements for approximate least squares regression and low rank approximation [Sar06], and implies faster low rank approximation for popular kernels in machine learning such as the gaussian and Sobolev kernels. Our main result has also already been applied to improve dimensionality reduction guarantees for k-means clustering [CEM+15], andalso implies new results for nonparametric regression when combined with results in [YPW15].

Lastly, we point out a minor but interesting observation that the proof of the “BSS” deterministic row-sampling result of [BSS12] can be modified to show that for any matrices A, B of stable rank at most ˜r, one can achieve the spectral norm guarantee for approximate matrix multiplication of  AT Busing a deterministic sampling matrix with  O(˜r/ε2) non-zero entries which can be found in polynomial time. The original result of [BSS12] was for rank instead of stable rank. Our observation leads to a stronger version of a main theorem of [KMST10].

Much recent work has successfully utilized randomized dimensionality reduction techniques to speed up solutions to linear algebra problems, with applications in machine learning, statistics, optimization, and several other domains; see the recent monographs [HMT11, Mah11, Woo14] for more details. In our work here, we give new spectral norm guarantees for approximate matrix multiplication (AMM). Aside from AMM being interesting in its own right, it has become a useful primitive in the literature for analyzing algorithms for other large-scale linear algebra problems as well. We show applications of our new guarantees to speeding up standard algorithms for generalized regression and low-rank approximation problems. We also describe applications of our results to k-means clustering (discovered in [CEM+15]) and nonparametric regression [YPW15].

In AMM we are given A, B each with a large number of rows n, and the goal is to compute some matrix  C such that ∥C − AT B∥Xis “small”, for some norm  ∥ · ∥X. Furthermore, we would like to compute C much faster than the usual time required to exactly compute  AT B.

Work on randomized methods for AMM began with [DKM06], which focused on  ∥· ∥X = ∥· ∥F ,i.e., Frobenius norm error. They showed by picking an appropriate sampling matrix Π  ∈ Rm×n,

image

with good probability, if  m = Ω(1/ε2). By a sampling matrix, we mean the rows of Π are independent, and each row is all zero except for a 1 in a random location according to some appropriate distribution. If  A ∈ Rn×d and B ∈ Rn×p, note (ΠA)T (ΠB) can be computed in O(mdp) time once ΠA and ΠBare formed, as opposed to the straightforward O(ndp) time to compute  AT B.

The Frobenius norm error guarantee of Eq. (1) was also later achieved in [Sar06, Lemma 6] via a different approach, with some later optimizations to the parameters in [KN14, Theorem 6.2]. The approach of Sarl´os was not via sampling, but rather to use a matrix Π drawn from a distribution satisfying an “oblivious Johnson-Lindenstrauss (JL)” guarantee, i.e. a distribution  D over Rm×nsatisfying the following condition for some  ε, δ ∈ (0, 1/2):

image

Such a matrix Π can be taken with  m = O(ε−2 log(1/δ)) [JL84]. Furthermore, one can take Π to be a Fast JL transform [AC09] (or any of the follow-up improvements [AL13, KW11, NPW14, Bou14, HR16]) or a sparse JL transform [DKS10, KN14] to speed up the computation of ΠA and ΠB. One could also use the Thorup-Zhang sketch [TZ12] combined with a certain technique of [LBKW14] (see [Woo14, Theorem 2.10] for details) to efficiently boost success probability.

Other than Frobenius norm error, the main other error guarantee investigated in previous work is spectral error. That is, we would like  ∥C − AT B∥to be small, where  ∥M∥denotes the largest singular value of M. If one is interested in applying  AT Bto some set of input vectors then this type of error is the most meaningful, since  ∥C − AT B∥being small is equivalent to  ∥Cx∥ ≈ ∥AT Bx∥for any x. The first work along these lines was again by [DKM06], who gave a procedure based on entry-wise sampling of the entries of A and B. The works [DMM06, SS11] showed that row-sampling according to leverage scores also provides the desired guarantee with few samples.

Then [Sar06], combined with a quantitative improvement in [CW13], showed that one can take a Π drawn from an oblivious JL distribution with  δ = 2−Θ(r) where r(·) denotes rank and r = r(A)+r(B). Then for Π with  m = O((r +log(1/δ))/ε2), with probability at least 1−δ over Π,

image

As we shall see shortly via a very simple lemma (Lemma 1), a sufficient deterministic condition implying Eq. (3) is that Π is an  O(ε)-subspace embedding for the r-dimensional subspace spanned by the columns of A, B. The notion of a subspace embedding was introduced by [Sar06].

Definition 1. Π is an ε-subspace embedding for  U ∈ Rn×r, U T U = I, if Πsatisfies Eq. (3) with A = B = U, i.e. ∥(ΠU)T (ΠU) − I∥ ≤ ε. This is equivalent to  ∀x ∈ Rr, (1 − ε)∥x∥22 ≤ ∥ΠUx∥22 ≤(1 +  ε)∥x∥22, i.e. Πpreserves norms of all vectors in the subspace spanned by the columns U.

An (ε, δ, r)-oblivious subspace embedding (OSE)  is a distribution D over Rm×n such that

image

Fast subspace embeddings Π, i.e. such that the products ΠA and ΠBcan be computed quickly, are known using variants on the Fast JL transform such as the Subsampled Randomized Hadamard Transform (SRHT) [Sar06, LWM+07, Tro11, LDFU13] (also see a slightly improved analysis of the SRHT in Section A.2) or via sparse subspace embeddings [CW13, MM13, NN13, LMP13, CLM+15,Coh16]. In most applications it is important to have a fast subspace embedding to shrink the time it takes to transform the input data to a lower-dimensional form. The SRHT is a construction of a Π such that ΠA can be computed in time O(nd log n) (see Section A.2 for details of the construction). The sparse subspace embedding constructions have some parameter m rows and exactly s non-zero entries per column, so that ΠA can be computed in time  O(s·nnz(A)), where nnz(·) is the number of non-zero entries, and there is a tradeoff in the upper bounds between m and s.

An issue addressed by the work of [MZ11] is that of robustness. As stated above, achieving Eq. (3) requires Π be a subspace embedding for an r-dimensional subspace. However, consider the case when A (and similarly for B) is of high rank but can be expressed as the sum of a low-rank matrix plus high-rank noise of small magnitude, i.e.,  A = ˜A + EA for ˜A of rank r( ˜A) ≪ r, andwhere  ∥EA∥is very small but  EAhas high (even full) rank. One would hope the noise could be ignored, but standard results require Π to have a number of rows at least as large as r, regardless of how small the magnitude of the noise is. Another case of interest (as we will see in Section 3) is when A and B are each of high rank, but their singular values decay at some appropriate rate. As discussed in Section 3, in several applications where AMM is not the final goal but rather is used as a primitive in analyzing an algorithm for some other problem (such as k-means clustering or nonparametric regression), the matrices that arise do indeed have such decaying singular values.

The work [MZ11] remedied this by considering the  stable ranks ˜r(A), ˜r(B) of A and B. Define˜r(A) = ∥A∥2F /∥A∥2. Note ˜r(A) ≤ r(A) always, but can be much less if A has a small tail of singular values. Let ˜r denote ˜r(A)+ ˜r(B). Among other results, [MZ11] showed that to achieve Eq. (3) with good probability, one can take Π to be a random (scaled) sign matrix with either  m = Ω(˜r/ε4) orm = Ω(˜r log(d + p)/ε2) rows. As noted in follow-up work [KVZ14], both the 1/ε4 dependence and the log(d + p) factor are undesirable. In their data-driven low dimensional embedding application, they wanted a dimension m independent of the original dimensions, which are assumed much larger than the stable rank, and also wanted lower dependence on 1/ε. To this end, [KVZ14] defined the nuclear rank as ˜nr(A) = ∥A∥∗/∥A∥and showed  m = Ω( ˜nr/ε2) rows suffice for ˜nr = ˜nr(A)+ ˜nr(B).Here  ∥A∥∗is the nuclear norm, i.e., sum of singular values of  A. Since ∥A∥2F is the sum of squared singular values, it is straightforward to see that ˜nr(A) ≥ ˜r(A) always. Thus there is a tradeoff: the stable rank guarantee is worsened to nuclear rank, but dependence on 1/εis improved to quadratic.

We show switching to the weaker ˜nr guarantee is unnecessary by showing quadratic dependence on 1/εholds even with stable rank. This answers the main open question of [MZ11, KVZ14].

To state our results in a more natural way, we rephrase our main result to say that we achieve

image

for an arbitrary  k ≥1, and we do so by using subspace embeddings for O(k)-dimensional subspaces in a certain black box way (which will be made precise soon) regardless of the ranks of A, B.

Remark 1. Note that our previously stated main contribution is equivalent, since one could set  k = ˜r(A) + ˜r(B) to arrive at the conclusion that subspace embeddings for  O(˜r)-dimensional subspaces yield the guarantee in Eq. (3). Alternatively one could obtain the Eq. (4) guarantee via Eq. (3) with error parameter  ε′ = Θ(ε · min{1,�(˜r(A) · ˜r(B))/k}).

Henceforth, we use the following definition.

Definition 2. For conforming matrices  AT , B, we say Πsatisfies the (k, ε)-approximate spectral norm matrix multiplication property ((k, ε)-AMM) for A, B if Eq. (4) holds. If Πis random and satisfies (k, ε)-AMM with probability 1−δfor any fixed A, B, then we say Π satisfies (k, ε, δ)-AMM.

Our main contribution: We give two different characterizations for Π supporting (k, ε)-AMM,both of which imply (k, ε, δ)-AMM Π having  m = O((k + log(1/δ))/ε2) rows.The first characterization applies to any OSE distribution for which a moment bound has been proven for ∥(ΠU)T (ΠU) − I∥(which is true for the best analyses of all known OSE’s). In this case, we show a black box theorem: any (ε, δ, 2k)-OSE provides (k, ε, δ)-AMM. Since matrices with subgaussian entries and  m = Ω((k + log(1/δ))/ε2) are (ε, δ, 2k)-OSE’s, our originally stated main result follows. This result is optimal, since [NN14] shows any randomized distribution over Π with m rows having the (k, ε, δ)-AMM property must have  m = Ω((k + log(1/δ))/ε2) (the hard instance there is when A = B = U has orthonormal columns, and thus rank and stable rank are equal).

Our second characterization identifies certain deterministic conditions which, if satisfied by Π, imply the desired (k, ε)-AMM property. These conditions are of the form: (1) Π should preserve a certain set of  O(log(1/ε)) different subspaces of varying dimensions (all depending on  k, ε and noton the ranks of A, B) with varying distortions, and (2) for a certain two matrices in our analysis, left-multiplication by Π should not increase their operator norms by more than an O(1) factor. These conditions are chosen carefully so that matrices with subgaussian entries and  m = Ω(k/ε2)satisfy all conditions simultaneously with high probability, again thus proving our main result while also suggesting that the conditions we have identified are the “right” ones.

Due to the black box reliance on the subspace embedding primitive in our proofs, Π need not only be a subgaussian map. Thus not only do we improve on m compared with previous work, but also in terms of the general class of Π our result applies to. For example given our first characterization, not only does it suffice to use a random sign matrix with Ω(k/ε2) rows, but in fact one can apply our theorem to more efficient subspace embeddings such as the SRHT or sparse subspace embeddings, or even constructions discovered in the future. That is, one can automatically transfer bounds proven for the subspace embedding property to the (k, ε)-AMM property. Thus, for example, the best known SRHT analysis (in our appendix, see Theorem 9) implies (k, ε, δ)-AMMfor  m = Ω((k + log(1/(εδ)) log(k/δ))/ε2) rows. For sparse subspace embeddings, the analysis in [Coh16] implies  m = Ω(k log(k/δ)/ε2) suffices with  s = O(log(k/δ)/ε) non-zeroes per column of Π. The only reason for the log k loss in m for these particular distributions is not due to our theorems, but rather due to the best analyses for the simpler subspace embedding property in previous work already incurring the extra log k factor (note being a subspace embedding for a k-dimensional subspace is simply a special case of (k, ε)-AMM where A = B = U has k orthonormal columns). In the case of the SRHT, this extra log k factor is actually necessary [Tro11]; for sparse subspace embeddings, it is conjectured that the log k factor can be removed and that  m = Ω((k+log(1/δ))/ε2)actually suffices to obtain an OSE [NN13, Conjecture 14]. We also discuss in Remark 3 that one can set Π to be Π1 · Π2 where Π1has subgaussian entries with  O(k/ε2) rows, and Π2 is someother fast OSE (such as the SRHT or sparse subspace embedding), and thus one could obtain the best of both worlds: (1) Π has  O(k/ε2) rows, and (2) can be applied to any  A ∈ Rn×d in timeT + O(km′d/ε2), where Tis the (fast) time to apply Π2 to A, and m′ is the number of rows of Π2.For example, by appropriate composition as discussed in Remark 3, Π can have  O(k/ε2) rows andsupport multiplying ΠA for A ∈ Rn×d in time O(nnz(A)) + ˜O(ε−O(1)(k3 + k2d)).

We also observe the proof of the main result of [BSS12] can be modified to show that given any A, B each with n rows, and given any  ε ∈ (0, 1/2), there exists a diagonal matrix Π  ∈ Rn×n withO(k/ε2) non-zero entries, and that can be computed by a deterministic polynomial time algorithm, achieving (k, ε)-AMM. The original work of [BSS12] achieved Eq. (3) with  m = O(r/ε2) for r beingthe sum of ranks of A, B. The work [BSS12] stated their result for the case A = B, but the general case of potentially unequal matrices reduces to this case; see Section 4. Our observation also turns out to yield a stronger form of [KMST10, Theorem 3.3]; also see Section 4.

As mentioned, aside from AMM being interesting on its own, it is a useful primitive widely used in analyses of algorithms for several other problems, including k-means clustering [BZMD15, CEM+15], nonparametric regression [YPW15], linear least squares regression and low-rank approximation [Sar06], approximating leverage scores [DMMW12], and several other problems (see [Woo14] for a recent summary). For all these, analyses of correctness for algorithms based on dimensionality reduction via some Π rely on Π satisfying AMM for certain matrices in the analysis.

After making certain quantitative improvements to connections between AMM and applications, and combining them with our main result, in Section 3 we obtain the following new results.

1.  Generalized regression: Given A ∈ Rn×d and B ∈ Rn×p, consider the problem of computing  X∗ = argminX∈Rd×p ∥AX − B∥. It is standard that  X∗ = (AT A)+AT B where (·)+ is theMoore-Penrose pseudoinverse. The bottleneck here is computing  AT A, taking O(nd2) time.A popular approach is to instead compute ˜X = ((ΠA)T (ΠA))+(ΠA)T ΠB, i.e., the minimizer of  ∥ΠAX − ΠB∥. Note that computing (ΠA)T (ΠA) (given ΠA) only takes a smaller  O(md2)amount of time. We show that if Π satisfies (k, O(√ε))-AMM for UA, P ¯AB, and is also an O(1)-subspace embedding for a certain r(A)-dimensional subspace (see Theorem 3), then

image

where  PAis the orthogonal projection onto the column space of  A, P ¯A = I − PA, and UA hasorthonormal columns forming a basis for the column space of A. The punchline is that if the regression error  P ¯ABhas high actual rank but stable rank only on the order of r(A), then we obtain multiplicative spectral norm error with Π having fewer rows. Generalized regression is a natural extension of the case when B is a vector, and arises for example in Regularized Least Squares Classification, where one has multiple (non-binary) labels, and for each label one creates a column of B; see e.g. [CLL+10] for this and variations.

2. Low-rank approximation: We are given  A ∈ Rn×d and integer  k ≥1, and we want to compute  Ak = argminr(X)≤k ∥A − X∥. The Eckart-Young theorem implies  Akis obtained by truncating the SVD of A to the top k singular vectors. The standard way to use dimensionality reduction for speedup, introduced in [Sar06], is to let  S = ΠAthen compute ˜A = APS. Thenreturn ˜Ak, the best rank-k approximation of ˜A, instead of  Ak(it is known ˜Akcan be computed more efficiently than  Ak; see [CW09, Lemma 4.3]). We show if Π satisfies (k, O(√ε))-AMMfor  Uk and A − Ak, and is a (1/2)-subspace embedding for the column space of  Ak, then

image

The punchline is that if the stable rank of the tail  A − Akis on the same order as the rank parameter k, then standard algorithms from previous work for Frobenius multiplicative error actually in fact also provide spectral multiplicative error. This property indeed holds for any k for popular kernel matrices in machine learning such as the gaussian and Sobolev kernels (see [RHV11] and Examples 2 and 3 of [YPW15]), and low-rank approximation of kernel matrices has been applied to several machine learning problems; see [GM13] for a discussion.

We also explain in Section 3 how our result has already been applied in recent work on dimensionality reduction for k-means clustering [CLM+15], and how it generalizes results in [YPW15] on dimensionality reduction for nonparametric regression to use a larger class of embeddings Π.

1.1 Preliminaries and notation

We frequently use the singular value decomposition (SVD). For a matrix  A ∈ Rn×d of rank r,consider the compact SVD  A = UAΣAV TA where UA ∈ Rn×r and VA ∈ Rd×reach have orthonormal columns, and ΣAis diagonal with strictly positive diagonal entries (the singular values of A). We assume (ΣA)i,i ≥ (ΣA)j,j for i < j. We let PA = UAU TA denote the orthogonal projection operator onto the column space of A. We use span(A) to refer to the subspace spanned by A’s columns.

Often for a matrix  A we write Akas the best rank-k approximation to A under Frobenius or spectral error (obtained by writing the SVD of A then setting all (ΣA)i,i to 0 for i > k). We often denote  A−Ak as A¯k. For matrices with orthonormal columns, such as  UA, (UA)kdenotes the  n×kmatrix formed by removing all but the first k columns of U. When A is understood from context, we often write  UΣV T instead of UAΣAV TA , and Ukto denote (UA)k (and Σk for (ΣA)k, etc.).

First we record a simple lemma relating subspace embeddings and AMM.

image

Proof. First, without loss of generality we may assume  ∥A∥ = ∥B∥= 1 since we can divide both sides of Eq. (3) by  ∥A∥ · ∥B∥. Let Ube a matrix whose columns form an orthonormal basis for E. Then note for any x, y we can write  Ax = Uw, By = Uz where ∥w∥ ≤ ∥x∥, ∥z∥ ≤ ∥y∥. Then

image

image

Lemma 1 implies that if A, B each have rank at most r, it suffices for Π to have Ω(r/ε2) rows.In the following two subsections, we give two different characterizations for Π to provide (k, ε)-AMM, both only requiring Π to have Ω(k/ε2) rows, independent of r.

2.1 Characterization for (k, ε, δ)-AMM via a moment property

Here we provide a way to obtain (k, ε)-AMM for any Π whose subspace embedding property has been established using the moment method, e.g. sparse subspace embeddings [MM13, NN13, Coh16], dense subgaussian matrices as analyzed in Section A.1, or even the SRHT as analyzed in Section A.2. Our approach in this subsection is inspired by the introduction of the “JL-moment property” in [KN14] to analyze approximate matrix multiplication with Frobenius error. The following is a generalization of [KN14, Definition 6.1], which was only concerned with d = 1.

Definition 3. A distribution  D over Rm×n has (ε, δ, d, ℓ)-OSE moments if for all matrices  U ∈Rn×d with orthonormal columns,

image

Note that this is just a special case of bounding the expectation of an arbitrary function of ∥(ΠU)T (ΠU) − I∥. The arguments below will actually apply to any nonnegative, convex, increasing function of  ∥(ΠU)T (ΠU) − I∥2, but we restrict to moments for simplicity of presentation. The acronym “OSE” refers to oblivious subspace embedding, a term coined in [NN13] to refer to distributions over Π yielding a subspace embedding for any fixed subspace of a particular bounded dimension with high probability. We start with a simple lemma.

Lemma 2. Suppose D satisfies the (ε, δ, 2d, ℓ)-OSE moment property and A, B are matrices with (1) the same number of rows, and (2) sum of ranks at most 2d. Then

image

Proof. First, we apply Lemma 1 to A and B, where U forms an orthonormal basis for the subspace span{columns(A), columns(B)}, showing that

image

Therefore

image

Then, just as [KN14, Theorem 6.2] showed that having OSE moments with d = 1 implies approximate matrix multiplication with Frobenius norm error, here we show that having OSE moments for larger d implies approximate matrix multiplication with operator norm error.

Theorem 1. Given k, ε, δ ∈ (0, 1/2), let Dbe any distribution over matrices with n columns with the (ε, δ, 2k, ℓ)-OSE moment property for some  ℓ ≥ 2. Then, for any A, B,

image

Proof. We can assume A, B each have orthogonal columns. This is since, via the full SVD, there exist orthogonal matrices  RA, RB such that ARA and BRBeach have orthogonal columns. Since neither left nor right multiplication by an orthogonal matrix changes operator norm,

image

Thus, we replace  A by ARAand similarly for B. We may also assume the columns  a1, a2, . . . ofA are sorted so that  ∥ai∥2 ≥ ∥ai+1∥2 for all i. Henceforth we assume A has orthogonal columns in this sorted order (and similarly for B, with columns  bi). Now, treat A as a block matrix in which the columns are blocked into groups of size k, and similarly for B (if the number of columns of either A or B is not divisible by k, then pad them with all-zero columns until they are, which does not affect the claim). Let the spectral norm of the ith block of  A be si = ∥a(i−1)·k+1∥2, and for Bdenote the spectral norm of the ith block as  ti = ∥b(i−1)·k+1∥2. These equalities for A, B hold since their columns are orthogonal and sorted by norm. We claim �i s2i ≤ ∥A∥2 +∥A∥2F /k(and similarly for �i t2i ). To see this, let the blocks of  A be A′1, . . . , A′q where si = ∥A′i∥. Note s21 = ∥A′1∥ ≤ ∥A∥.Also, for i > 1 we have

image

image

We now discuss the implications of applying Theorem 1 to specific OSE’s.

Subgaussian maps: In Section A.1 we show that if Π has independent subgaussian entries and m = Ω((k+log(1/δ))/ε2) rows, then it satisfies the (ε, δ, 2k, Θ(k+log(1/δ))) OSE moment property. Thus Theorem 1 applies to show that such Π will satisfy (k, ε, δ)-AMM.

SRHT: The SRHT is the matrix product Π =  SHD where D ∈ Rn×n is n × ndiagonal with independent  ±1 entries on the diagonal, H is a “bounded orthonormal system” (i.e. an orthogonal matrix in  Rn×n with maxi,j |Hi,j| = O(1/√n)), and the m rows of S are independent and each samples a uniformly random element of [n]. Bounded orthonormal systems include the discrete Fourier matrix and the Hadamard matrix; thus such Π exist supporting matrix-vector multiplication in O(n log n) time. Thus when computing ΠA for some n × d matrix A, this takes time O(nd log n) (by applying Π to A column by column). In Theorem 9 we show that the SRHT with  m = Ω((k +log(1/(εδ)) log(k/δ))/ε2) satisfies the (ε, δ, 2k, log(k/δ))-OSE moment property, and thus provides (k, ε, δ)-AMM. Interestingly our analysis of the SRHT in Section A.2 seems to be asymptotically tighter than any other analyses in previous work even for the basic subspace embedding property, and even slightly improves the by now standard analysis of the Fast JL transform given in [AC09].

Sparse subspace embeddings: The sparse embedding distribution with parameters m, s is as follows [CW13, NN13, KN14]. The matrix Π has m rows and n columns. The columns are independent, and for each column exactly s uniformly random entries are chosen without replacement and set to  ±1/√sindependently; other entries in that column are set to zero. Alternatively, one could use the CountSketch [CCF04]: the m rows are equipartitioned into s sets of size m/s each. The columns are independent, and in each column we pick exactly one row from each of the s partitions and set the corresponding entry in that column to  ±1/√suniformly; the rest of the entries in the column are set to 0. Note ΠA can be multiplied in time  O(s·nnz(A)), and thus small s is desirable.

It was shown in [MM13, NN13], slightly improving [CW13], that either of the above distributions satisfies the (ε, δ, k,2)-OSE moment property for  m = Ω(k2/(ε2δ)), s= 1, and hence (k, ε, δ)-AMM(though this particular conclusion follows easily from [KN14, Theorem 6.2]). It was also shown in [Coh16], improving upon [NN13], that they satisfy the (ε, δ, k, log(k/δ))-OSE moment property, and hence also (k, ε, δ)-AMM, for m = Ω(Bk log(k/δ)/ε2), s = Ω(logB(k/δ)/ε) for any B > 2. It isconjectured that for  B = O(1), m = Ω((k + log(1/δ))/ε2) should suffice [NN13, Conjecture 14].

Remark 2. The work [Coh16] does not explicitly discuss the OSE moment property for sparse subspace embeddings. Rather, [Coh16] bounds  E eE2ℓ/ε = O(k) for E = ∥(ΠU)T (ΠU) − I∥ andℓ = log(k/δ). Note though for  x ≥0 and integer  ℓ ≥ 1, xℓ ≤ ℓ! · ex ≤ ℓℓ · ex by Taylor expansion of the exponential. Setting  x = 2ℓE/ε, [Coh16] thus implies  E(2ℓE/ε)ℓ ≤ ℓℓ · E eE2ℓ/ε = O(k). ThusE Eℓ = O(k) · (ε/2)ℓ < δby choice of  ℓ, which is the (ε, δ, k, log(k/δ))-OSE moment property.

Remark 3. Currently there appears to be a tradeoff: one can either use Π such that ΠA can be computed quickly, such as sparse subspace embeddings or the SRHT, but then the number of rows m is at least k log k. Alternatively one could achieve the optimal  m = O(k/ε2) using subgaussian Π, but then multiplying by Π is slower:  O(mnd) time for A ∈ Rn×d.However, settling for a tradeoff is unnecessary. One can actually obtain the “best of both worlds” by composition, i.e. the multiplication Π = Π1 · Π2of two matrices both supporting AMM. Thus Π2could be a fast matrix providing AMM to low (but suboptimal) dimension, and Π1a “slow” (e.g. subgaussian) matrix with the optimal  O(k/ε2) number of rows. In fact one can even set Π = Π1Π2Π3 whereΠ3is the sparse subspace embedding with  O(k2/ε2) rows and s = 1, Π2is the SRHT, and Π1 isa subgaussian matrix. Then ΠA will have the desired  O(k/ε2) rows and can be computed in time O(nnz(A)) + ˜O(ε−O(1)(k3 + k2d)); see Section A.3 for justification.

2.2 Characterization for (k, ε, )-AMM via deterministic events

Here we provide a different characterization for achieving (k, ε)-AMM. Without loss of generality we assume max{∥A∥2, ∥A∥2F /k} = max{∥B∥2, ∥B∥2F /k} = 1 (so ∥A∥2, ∥B∥2 ≤ 1 and ∥A∥2F , ∥B∥2F ≤ k).Let w, w′ each be minimal such that  ∥A ¯w∥, ∥B ¯w′∥ ≤ ε/C′for some sufficiently large constant C′ (which will be set in the proof of Theorem 2). It was shown that  w, w′ = O(k/ε2) in the proof of Theorem 3.2 (i.b) in [MZ11]. Write the SVDs  Aw = UAwΣAwV TAw, Bw′ = UBw′ ΣBw′V TBw′ .

For 0  ≤ i ≤ log2(1/ε2) define D′i as set of all columns of  UAw, UBw′ whose corresponding squared singular values (from ΣAw, ΣBw′) are at least 1/2i. Let DAwbe the set of min{k, w} largest singular vectors from  UAw, and define  DBw′similarly. Define  Di = D′i ∪ DAw ∪ DBw′ . Let sidenote the dimension of span(Di), and note the  siare non-decreasing.

Let ˜si be siafter rounding up to the nearest power of 2. Group all i with the same value of ˜siinto groups  G1, G2, . . . , Glog2(1/ε2). For example if for  i = 0, 1, 2, 3 the si are 3, 4, 15,16 then the ˜siare 4, 4, 16, 16 and G1 = {0, 1}, G2 = {2, 3}. Let vjbe the common value of ˜si for i in Gj.

image

Proof. Define s = |DAw ∪ DBw′ | ≤ 2k and let s′i denote the dimension of span(D′i). Then the above summation is at most �i(s/2i + s′i/2i) ≤ 4k + �i s′i/2i. It thus suffices to bound the second summand by 4k.

Note that we can find a basis for  D′i among the columns of  UAw, UBw′ with corresponding squared singular value at least 1/2i, so let ai +bi = s′i, where aiis the number of columns of  UAw inthe basis and  bithe number of columns of  UBw′in the basis. Then by averaging, if the inequality of the lemma statement does not hold then either �i ai/2i > 2k or �i bi/2i > 2k. Without loss of generality assume the former.

Consider an arbitrary column of  UAw, and suppose it has squared singular value in the range [1/2i, 1/2i−1).Then it is in span(D′j) for all j ≥ i.Its contribution to �i ai/2iis therefore 1/2i + 1/2i+1 + . . .which is at most 2/2i = 1/2i−1. It follows that �i ai/2i ≤ 2k, since the squared Frobenius norm of  Awis at most k. This is a contradiction to �i ai/2i > 2k. ■Now we prove the main theorem of this subsection.

image

(1) If  w + w′ ≤ k, then Π is an ε/C-subspace embedding for the subspace spanned by the columns of  Aw, Bw′. Otherwise if  w + w′ > k, then for each 0  ≤ i ≤ log2(1/ε2), Π is an εi/C-subspaceembedding for span(Di′) with

image

Proof. We would like to bound

image

(2). Furthermore by the definition of  w, w′ we know ∥A ¯w∥, ∥B ¯w′∥ ≤ ε/C′, and thus ζ + η +Θ  ≤ 2ε/C′ + (ε/C′)2. Note condition (1) implies that Π is a (1/2)-subspace embedding for the subspace spanned by columns of  Aw, Bw′(by taking i maximal). Thus by both conditions we have β, γ ≤ (ε/C)(1 + 1/2).

It only remains to bound  α. If w + w′ ≤ k, then we are done by condition (1) and Lemma 1. Thus assume  w + w′ > k. Then we have

image

Let x, y be any unit norm vectors. Write  x = x1 + x2 + . . . + xb for b = log2(1/ε2), wherexi is the restriction of x to coordinates for which the corresponding squared singular values in ΣAw are in (1/2i, 1/2i−1].Similarly define  y1, . . . , yb. Then |�ΠUAwΣAwx, ΠUBw′ ΣBw′ y�−

image

image

We bound the first sum, as bounding the second is similar. Note  UAwΣAwxi, UBw′ ΣBw′�j≤i yj ∈Di. Therefore by property (1) and Lemma 1,

image

where Eq. (8) used that the corresponding v value in property (1) is at most 2si. Returning to Eq. (7) and applying Cauchy-Schwarz and Lemma 3,

image

We thus finally have that Eq. (6) is at most (2√8 + 3)ε/C + +(ε/C)2 + 2ε/C′ + (ε/C′)2, whichis at most  ε for C, C′ sufficiently large constants. ■

Now we discuss some implications of Theorem 2 for specific Π.

Example 1: Let Π have O(k/ε2) rows forming an orthonormal basis for the span of the columns of  Aw, Bw′. Property (1) is satisfied for every i in fact with  εi= 0. Property (2) is also satisfied since  ∥ΠA ¯w∥ ≤ ∥Π∥ · ∥A ¯w∥ ≤ ε, and similarly for bounding  ∥ΠB ¯w′∥.

Example 2: Let Π be a random  m×nmatrix with independent entries that are subgaussian with variance 1/m. For example, the entries of Π may be N(0, 1/m), or uniform in  {−1/√m, 1/√m}. Letm be Θ((k + log(1/δ))/ε2). As mentioned in Section A.1, such Π is an  ε-subspace embedding for a k-dimensional subspace with failure probability  δ. For property (1) of Theorem 2, if  w+w′ ≤ k thenwe would like Π to be an  ε-subspace embedding for a subspace of dimension at most k, which holds with failure probability  δ. If w + w′ > kthen we would like Π to be an  εi-subspace embedding for span(Di′) for all 1  ≤ i ≤ log2(1/ε2) simultaneously. Note maxj vj ≤ 2(w +w′) = O(k/ε2), and thus maxj vj ≤ m. Thus for a subspace under consideration Di′ for i′ ∈ Gj, we have failure probability δvj/k for our choice of m. By construction every  vjis at least  k, and the vjincrease at least geometrically. Thus our total failure probability is, by a union bound, �j δvj/k ≤ �j δ2j−1 = O(δ).Property (2) of Theorem 2 is satisfied with failure probability  δby [RV13, Theorem 3.2].

Spectral norm approximate matrix multiplication with dimension bounds depending on stable rank has immediate applications for the analysis of generalized regression and low-rank approximation problems. We also point out to the reader recent applications of this result to kernelized ridge regression [YPW15] and k-means clustering [CEM+15].

3.1 Generalized regression

Here we consider generalized regression: attempting to approximate a matrix B as AX, with A of rank at most  k. Let PAbe the orthogonal projection operator to the column space of A, with P ¯A = I − P; then the natural best approximation will satisfy

image

This minimizes both the Frobenius and spectral norms of  AX − B. A standard approximation algorithm for this is to replace A and B with sketches ΠA and ΠB, then solve the reduced problem exactly (see e.g. [CW09], Theorem 3.1). This will produce

image

Below we give a lemma on the guarantees of the sketched solution in terms of properties of Π.

image

1. satisfies the (k,�ε/8)-approximate spectral norm matrix multiplication property for  UA, P ¯AB

2. is a (1/2)-subspace embedding for the column space of A (which is implied by Π satisfying the spectral norm approximate matrix multiplication property for  UAwith itself)

image

Proof. We may write:

image

So far, we have shown that the error depends only on  P ¯AB and not PAB(with the third line following from the fact that the sketched regression is exact on  PAB). Now, in the last line, we can see that the two terms lie in orthogonal column spaces (the first in the span of A, the second orthogonal to it). For matrices X and Y with orthogonal column spans,  ∥X +Y ∥2 ≤ ∥X∥2 +∥Y ∥2,so this is at most

image

Spectral submultiplicativity then implies the first term is at most

image

∥UA∥is 1, since  UAis orthonormal. ((ΠUA)T ΠUA)−1 is at most 2, since Π is a subspace embedding for  UA. Finally, ∥(ΠUA)T ΠP ¯AB∥is at most

image

3.2 Low-rank approximation

Now we apply the generalized regression result from Section 3.1 to obtain a result on low-rank approximation: approximating a matrix A in the form ˜Uk ˜Σk ˜V Tk , where ˜Uk has only k columnsand both ˜Uk and ˜Vkhave orthonormal columns. Here, we consider a previous approach (see e.g. [Sar06]):

1. Let  S = ΠA.

2. Let  PSbe the orthogonal projection operator to the row space of  S. Let ˜A = APS.

3. Compute a singular value decomposition of ˜A, and keep only the top k singular vectors. Return the resulting low rank approximation ˜Ak of ˜A.

It turns out computing ˜Akcan be done much more quickly than computing  Ak; see details in [CW09, Lemma 4.3].

Let  Akbe the exact k-truncated SVD approximation of A (and thus the best rank-k approximation, in the spectral and Frobenius norms), and let  Ukbe the top k column singular vectors, and  A¯k = A − Akbe the tail.

image

1. satisfies the (k,�ε/8)-approximate spectral norm matrix multiplication property for  Uk, A¯k

2. is a (1/2)-subspace embedding for the column space of  Uk

image

Proof. Note that this procedure chooses the best possible (in the spectral norm) rank-k approximation to A subject to the constraint of lying in the row space of S. Thus, the spectral norm error can be no worse than the error of a specific such matrix we exhibit.

image

onto  Uk, with Π:

image

This is rank-k by construction, since it is multiplied by  Uk, and it lies in the row space of  S = ΠAsince that is the rightmost factor. On the other hand, it is an application of the regression algorithm to A where the optimum output is  Ak(since that is the projection of A onto the space of  Uk).Plugging this into Eq. (9) gives the desired result. ■

3.3 Kernelized ridge regression

In nonparametric regression one is given data  yi = f ∗(xi) + wi for i = 1, . . . , n, and the goal is to recover a good estimate for the function  f ∗. Here the yiare scalars, the  xiare vectors, and the  wiare independent noise, often assumed to be distributed as mean-zero gaussian with some variance σ2. Unlike linear regression where  f ∗(xi) is assumed to take the form  ⟨β, x⟩for some vector  β,in nonparametric regression we allow  f ∗ to be an arbitrary function from some function space. Naturally the goal then is to recover some ˜f from the data so that, as n grows, the probability that ˜f is “close” to  f ∗increases at some good rate.

The recent work [YPW15] considers the well studied problem of obtaining ˜f so that ∥ ˜f − f ∗∥2nis small with high probability over the noise w, where one uses the definition

image

The work [YPW15] considers the case where  f ∗ comes from a Hilbert space H of functions f such that f is guaranteed to be square integrable, and the map  x �→ f(x) is a bounded linear functional. The function ˜f is then defined to be the optimal solution to the Kernel Ridge Regression (KRR) problem of computing

image

for some parameter  λn. It is known that any H as above can be written as the closure of the set of all functions

image

over all  α ∈ RN and vectors  z1, . . . , zNfor some positive semidefinite kernel function k. Furthermore, the optimal solution to Eq. (11) can be expressed as  f LS = �ni=1 αLSi ·k(·, xi) for some choice of weight vector  αLS, and it is known that  ∥f LS − f ∗∥nwill be small with high probability, over the randomness in  w, if λnis chosen appropriately (see [YPW15] for background references and precise statements).

After rewriting Eq. (11) using Eq. (12) and defining a matrix  K with Ki,j = k(xi, xj), onearrives at a reformulation for KRR of computing

image

which can be computed in  O(n3) time. The work [YPW15] then focuses on speeding this up, by instead computing a solution to the lower-dimensional problem

image

and then returning as ˜f the function specified by the weight vector ˜α = ΠT ˜αLS. Note that once various matrix products are formed (where the running time complexity depends on the Π being used), one only needs to invert an  m × mmatrix thus taking  O(m3) time. They then prove that ∥ ˜f − f ∗∥nis small with high probability as long as Π satisfies two deterministic conditions (see the proof of Lemma 2 [YPW15, Section 4.1.2], specifically equation (26) in that work):

Π is a (1/2)-subspace embedding for a particular low-dimensional subspace

• ∥ΠB∥ = O(∥B∥) for a particular matrix B of low stable rank (B is UD2in [YPW15]). Note

image

and thus it suffices for Π to provide the approximate matrix multiplication property for the product  BT B, where Bhas low stable rank.

The first bullet simply requires a subspace embedding in the standard sense, and for the second bullet [YPW15] avoided AMM by obtaining a bound on  ∥ΠB∥directly by their own analyses for gaussian Π and the SRHT (in the gaussian case, it also follows from [RV13, Theorem 3.2]). Our result thus provides a unifying analysis which works for a larger and general class of Π, including for example sparse subspace embeddings.

3.4 k-means clustering

In the works [BZMD15, CEM+15], the authors considered dimensionality reduction methods for k-means clustering. Recall in k-means clustering one is given  n points x1, . . . , xn ∈ Rd, as well as an integer  k ≥1, and the goal is to find  k points y1, . . . , yk ∈ Rd minimizing

image

That is, the n points can be partitioned arbitrarily into k clusters, then a “cluster center” should be assigned to each cluster so as to minimize sums of squared Euclidean distances of each of the n points to their cluster centers. It is a standard fact that once a partition  P = {P1, . . . , Pk} of then points into clusters is fixed, the optimal cluster centers to choose are the centroids of the points in each of the k partitions, i.e.  yj = (1/|Pj|) · �i∈Pj xi.

One key observation common to both of the works [BZMD15, CEM+15] is that k-means clustering is closely related to the problem of low-rank approximation. More specifically, given a partition P = {P1, . . . , Pk}, define the  n × k matrix XP by

image

Let  A ∈ Rn×d have rows x1, . . . , xn. Then the k-means problem can be rewritten as computing

image

where P ranges over all partitions of {1, . . . , n} into k sets. It is easy to verify that the non-zero columns of  XPare orthonormal, so  XPXTP is the orthogonal projection onto the column space of  XP. Thus if one defines S as the set of all rank at most k orthogonal projections obtained as XPXTP for some k-partition P, then the above can be rewritten as the constrained rank-k projection problem of computing

image

One can verify this by hand, since the rows of A are the points  xi, and the ith row of PA forP = XPXTP is the centroid of the points in i’s partition in P.

rank-k projections) and Π ∈ Rm×d satisfies certain technical conditions to be divulged soon, then if ˜P ∈ S satisfies

image

then

image

the best rank-k approximation to  A and let A¯k = A − Ak. Define Z ∈ Rd×r for r = 2k byZ = Vr, i.e. the top r right singular vectors of A are the columns of  Z. Define B1 = ZT and

√k∥A¯k∥F · (A − AZZT). Define B ∈ R(n+r)×d as having B1as its first  r rows and B2as its lowern rows. Then [CEM+15, Lemma 10] states that Eq. (14) implies Eq. (15) as long as

image

One can easily check  ∥B∥2 = 1 and ∥B∥2F ≤ 3k, so the stable rank ˜r(B) is at most 3k. ThusEq. (16) is implied by the (3k, ε/2)-AMM property for  BT, BT , and our results apply to show that Π can be taken to have  m = O((k+log(1/δ))/ε2) rows to have success probability 1−δfor Eq. (16). Obtaining Eq. (17) is much simpler and can be derived from the JL moment property (see the proof of [KN14, Theorem 6.2]).

Without our results on stable-rank AMM provided in this current work, [CEM+15] gave adifferent analysis, avoiding [CEM+15, Lemma 10], which required Π to have  m = Θ(k·log(1/δ)/ε2)rows (note the product between  k and log(1/δ) instead of the sum).

As well as random projections, approximate matrix multiplication (and subspace embeddings) by row selection are also common in algorithms. This corresponds to setting Π to a diagonal matrix S with relatively few nonzero entries. Unlike random projections, there are no oblivious distributions of such matrices S with universal guarantees. Instead, S must be determined (either randomly or deterministically) from the matrices being embedded.

There are two particularly algorithmically useful methods for obtaining such S. The first is importance sampling: independent random sampling of the rows, but with nonuniform sampling probabilities. This is analyzed using matrix Chernoff bounds [AW02], and for the case of k-dimensional subspace embedding or approximate matrix multiplication of rank-k matrices, it can produce  O(k(log k)/ε2) samples [SS11]. The second method is the deterministic selection method given in [BSS12], often called “BSS”, choosing only  O(k/ε2) rows. This still runs in polynomial time, but originally required many relatively expensive linear algebra steps and thus was slower in general; see [LS15] for runtime improvements.

The matrix Chernoff methods can be extended to the stable-rank case, making even the log factor depend only on the stable rank, using “intrinsic dimension” variants of the bounds as presented in Chapter 7 of [Tro15]. Specifically, Theorem 6.3.1 of that work can be applied with each n summands each equal to 1n�1pi aTi bi − AT B�, where ai is the ith row of A, and iis random with the probability of choosing a particular row i equal to

image

We here give an extension of BSS that covers low stable rank matrices as well.

image

When  AT Ais the identity, this is just the original BSS result. It is also stronger than Theorem 3.3 of [KMST10], implying it when A is the combination of the rows�N/T · vifrom that theorem statement with an extra column containing the costs, and a constant  ǫ. The techniques in that paper, on the other hand, can prove a result comparable to Theorem 5, but with the row count scaling as  k/ε3 rather than  k/ε2.

Proof. The proof closely follows the original proof of BSS. However, for simplicity, and because the tight constants are not needed for most applications, we do not include [BSS12, Claim 3.6] and careful parameter-setting.

At each step, the algorithm will maintain a partial approximation  Z = (SA)T (SA) (the matrix “A” in [BSS12]), with S beginning as 0. Additionally, we keep track of upper and lower “walls”  Xuand  Xl; in the original BSS these are just multiples of the identity. The final S will be returned by the algorithm (rescaled by a constant so that the average of the upper and lower walls is  AT A).

We will maintain the invariants

image

These are the so-called upper and lower potentials from BSS. We also require  Xu ≺ Z ≺ Xl; recallM ≺ M′ means that M′ − Mis positive definite. Note that unlike [BSS12], here we do not apply a change of variables making  AT Athe identity (to avoid confusion, since that would change the Frobenius norm). This is the reason for the slightly more complicated form of the potentials.

In the original BSS,  Xu and Xlwere always scalar multiples of the identity (here, without the change of variables, that would correspond to always being multiples of  AT A). [BSS12] thus simply represented them with scalars. Like BSS, we will increase  Xu and Xlby multiples of  AT A–however,the key difference from BSS is that they are initialized to multiples of the identity, rather than AT A. In particular, we may initialize  Xu to kI and Xl to −kI. This is still good enough to get the spectral norm bounds we require here (as opposed to the stronger multiplicative approximation guaranteed by BSS).

We will have two scalar values,  δu and δl, depending only on  ε; they will be set later. One step consists of

1. Choose a row  ai from Aand a positive scalar  t, and add taiaTi to Z(via increasing the i component of S).

2. Add  δuAT A to Xu and δlAT A to Xl.

We will show that with suitable values of  δu and δl, for any Zobeying the invariants there always exists a choice of i and t such that the invariants will still be true after the step is complete. This corresponds to Lemmas 3.3 through 3.5 of BSS.

For convenience, we define, at a given step, the matrix functions of y

image

The upper barrier value, after making a step of  taiaTi and increasing  Xu, is

image

Applying the Sherman-Morrison formula, and cyclicity of trace, to the rank-1 update  taiaTi , thiscan be rewritten as

image

Since the function  f(y) = tr(AMu(y)AT ) is a convex function of y with derivative

image

we have  f(δu) − f(0) ≤ −δu tr(AMu(δu)AT AMu(δu)AT ). Then the difference between the barrier before and after the step is at most

image

Constraining this to be no greater than zero, rewriting in terms of 1t and pulling it out gives

image

Furthermore, as long as 1t is at least this, Z will remain below  Xu, since the barrier must approach infinity as t approaches the smallest value passing  Xu.For the lower barrier value after the step, we get

image

Again, applying Sherman-Morrison rewrites it as

image

Again, due to convexity the increase in the barrier from raising  Xlis at most  δltimes the local derivative. The difference in the barrier after the step is then at most

image

This is not greater than zero as long as

image

the upper bound. To show that there is at least one choice of i for which this holds, we look at the sum of all the lower bounds and compare to the sum of all the upper bounds. Summing the former over all i gets

image

Finally, note that

image

and the lower barrier implies  Z − Xl ≻ AT A, implying that as long as  δl ≤ 12,

image

Thus, we can always make a step as long as  δu and δlare set so that

image

Before the first step,  Xu and Xlcan be initialized as  kI and −kI, respectively. If the algorithm is then run for kε2steps, we have:

image

kS) satisfies the requirements of the output for 3ε(one can simply apply this argument for  ε/3). Furthermore, all the computations required to verify the preservation of invariants and compute explicit ts can be performed in polynomial time. ■

This obtains more general AMM as a corollary:

Corollary 1. Given two matrices A and B, each with n rows, and an  ε ∈ (0, 1), there exists a diagonal matrix  S with O(k/ε2)nonzero entries satisfying the (k, ε)-AMM property for A, B. Such an S can be computed by a polynomial-time algorithm.

Proof. Apply Theorem 5 to a matrix X consisting of the columns of A√2 max(∥A∥2,∥A∥F /√k) appended

image

Note that X satisfies the conditions of that theorem, since concatenating the sets of columns at most adds the squares of their spectral and Frobenius norms. (SA)T (SB) − AT Bis a submatrix of 2 max(∥A∥2, ∥A∥F /√k) max(∥B∥2, ∥B∥F /√k)((SX)T (SX) − XT X), so its spectral norm is upper bounded by the spectral norm of that matrix, which in turn is bounded by the guarantee of Theorem 5. ■

We thank Jaros�law B�lasiok for pointing out the connection between low stable rank approximate matrix multiplication and the analyses in [YPW15].

[AC09] Nir Ailon and Bernard Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009.

[AL13] Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Transactions on Algorithms, 9(3):21, 2013.

[AW02] Rudolf Ahlswede and Andreas J. Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569–579, 2002.

[Bou14] Jean Bourgain. An improved estimate in the restricted isometry problem. Geometric Aspects of Functional Analysis, 2116:65–70, 2014.

[BSS12] Joshua D. Batson, Daniel A. Spielman, and Nikhil Srivastava. Twice-Ramanujan sparsifiers. SIAM J. Comput., 41(6):1704–1721, 2012.

[BZMD15] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2015.

[CCF04] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004.

[CEM+15]Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and M˘ad˘alina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th ACM Symposium on Theory of Computing (STOC), 2015. Full version at http://arxiv.org/abs/1410.6801v3.

[CLL+10]Pei-Chun Chen, Kuang-Yao Lee, Tsung-Ju Lee, Yuh-Jye Lee, and Su-Yun Huang. Multiclass support vector classification via coding and regression. Neurocomputing, 73(7-9):1501–1512, 2010.

[CLM+15]Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 6th Annual Conference on Innovations in Theoretical Computer Science (ITCS), pages 181–190, 2015.

[Coh16] Michael B. Cohen. Simpler and tighter analysis of sparse oblivious subspace embeddings. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2016.

[CW09] Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 205–214, 2009. Full version at http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/cw09.pdf.

[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 81–90, 2013. Full version at http://arxiv.org/abs/1207.6365v4.

[DKM06] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J. Comput., 36(1):132–157, 2006.

[DKS10] Anirban Dasgupta, Ravi Kumar, and Tam´as Sarl´os. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341–350, 2010.

[DMM06] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127–1136, 2006.

[DMMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012.

[FR13] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkh¨auser, 2013.

[GM13] Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 567–575, 2013.

[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.

[HR16] Ishay Haviv and Oded Regev. The restricted isometry property of subsampled Fourier matrices. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2016.

[HW71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Statist., 42(3):1079–1083, 1971.

[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.

[KMN11] Daniel M. Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit johnson-lindenstrauss families. In Proceedings of the 14th International Workshop on Randomization and Computation (RANDOM), pages 628–639, 2011.

[KMST10] Alexandra Kolla, Yury Makarychev, Amin Saberi, and Shang-Hua Teng. Subgraph sparsification and nearly optimal ultrasparsifiers. In Proceedings of the Forty-second ACM Symposium on Theory of Computing, STOC ’10, pages 57–66, New York, NY, USA, 2010. ACM.

[KN14] Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1):4, 2014.

[KVZ14] Anastasios T. Kyrillidis, Michail Vlachos, and Anastasios Zouzias. Approximate matrix multiplication with application to linear embeddings. CoRR, abs/1403.7683, 2014.

[KW11] Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property. SIAM J. Math. Anal., 43(3):1269–1281, 2011.

[LBKW14] Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P. Woodruff. Improved distributed principal component analysis. In Proceedings of the 27th Annual Conference on Advances in Neural Information Processing Systems (NIPS), pages 3113–3121, 2014.

[LDFU13] Yichao Lu, Paramveer Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized Hadamard transform. In Proceedings of the 26th Annual Conference on Advances in Neural Information Processing Systems (NIPS), 2013.

[LMP13] Mu Li, Gary L. Miller, and Richard Peng. Iterative row sampling. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 127–136, 2013.

[LP86] Fran¸cois Lust-Piquard. In´egalit´es de Khintchine dans  Cp (1 < p < ∞). C. R. Math.Acad. Sci. Paris, 303(7):289–292, 1986.

[LPP91] Fran¸cois Lust-Piquard and Gilles Pisier. Noncommutative Khintchine and Paley inequalities. Ark. Mat., 29(2):241–260, 1991.

[LS15] Yin Tat Lee and He Sun. Constructing linear sized spectral sparsification in almost linear time. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 250–269, 2015.

[LWM+07]Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.

[Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.

[MM13] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 91–100, 2013.

[MZ11] Avner Magen and Anastasios Zouzias. Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1422–1436, 2011.

[NN13] Jelani Nelson and Huy L. Nguy˜ˆen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 117–126, 2013.

[NN14] Jelani Nelson and Huy L. Nguy˜ˆen. Lower bounds for oblivious subspace embeddings. In Proceedings of the 41st International Colloquium on Automata, Languages, and Programming (ICALP), pages 883–894, 2014.

[NPW14] Jelani Nelson, Eric Price, and Mary Wootters. New constructions of RIP matrices with fast multiplication and fewer rows. In Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), 2014.

[RHV11] Nima Reyhani, Hideitsu Hino, and Ricardo Vig´ario. New probabilistic bounds on eigenvalues and eigenvectors of random kernel matrices. In Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence (UAI), pages 627–634, 2011.

[RV13] Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:1–9, 2013.

[Sar06] Tam´as Sarl´os. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143–152, 2006.

[SS11] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011.

[Tro11] Joel A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal., 3(1–2):115–126, 2011.

[Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.

[TZ12] Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM J. Comput., 41(2):293– 331, 2012.

[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.

[YPW15] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. CoRR, abs/1501.06195, 2015.

In the following two subsections we show the OSE moment property for both subgaussian matrices and the SRHT.

image

In this section, we show the OSE moment property for distributions satisfying a JL condition, namely the JL moment property. This includes matrices with i.i.d. entries that are mean zero and subgaussian with variance 1/m.

Definition 4. [KMN11] Let D be a distribution over  Rm×n. We say D has the (ε, δ, p)-JL momentproperty if for all  x ∈ Rn of unit norm,

image

The following theorem follows from the proof of Lemma 8 in the full version of [CW13]. We give a different proof here inspired by the proof of [FR13, Theorem 9.9], which is slightly shorter and more self-contained. A weaker version appears in [Sar06, Lemma 10], where the size bound on X is (Cd/ε)d for a constant  C ≥1 instead of simply  Cd.

Theorem 6. Let U ∈ Rn×d with orthonormal columns be arbitrary. Then there exists a set  X ⊂ Rn,|X| ≤ 9d, each of norm at most 1 such that

image

Proof. We will show that if supx∈X |∥Πx∥2 − 1| < ε/2 then ∥(ΠU)T (ΠU) − I∥ < ε, where ε > 0is some positive real. Define  A = (ΠU)T (ΠU) − I. Since Ais symmetric,

image

Let  Tγbe a finite  γ-net of ℓd2, i.e. Tγ ⊂ ℓd2 and for every  x ∈ Rdof unit norm there exists a  y ∈ Tγsuch that  ∥x − y∥2 ≤ γ. As we will see soon, there exists such a  Tγof size at most (1 + 2/γ)d. Wewill show that if Π satisfies the JL condition on  T ′ = {Uy : y ∈ T1/4}with error  ε/2, then ∥A∥ < ε;that is, (1  − ε/2)∥x∥22 ≤ ∥Πx∥22 ≤ (1 + ε/2)∥x∥22 for all x ∈ T ′.

Let x be a unit norm vector that achieves the sup above, i.e.  ∥A∥ = | ⟨Ax, x⟩ |. Then, letting y be the closest element of  Tγ to x,

image

Rearranging gives  ∥A∥ ≤ ε/(2(1 − 2γ)), which is  ε for γ = 1/4.

Now we must show that we can take  |Tγ| ≤ (1 + 2/γ)d. The following is a standard covering/packing argument for bounding metric entropy. Imagine packing as many radius-(γ/2) ℓ2 ballsas possible into  Rd, centered at points with at most unit norm and such that these balls do not intersect each other. Then these balls all fit into a radius-(1 +  γ/2) ℓ2ball centered at the origin, and thus the number of balls we have packed is at most the ratio of the volume of a (1 +  γ/2) ballto the volume of a  γ/2 ball, which is ((1 +  γ/2)/(γ/2))d = (1 + 2/γ)d. Now, take those maximally packed radius-(γ/2) balls and double each of their radii to be radius  γ. Then every point in the unit ball is contained in at least one of these balls by the triangle inequality, which is exactly the property we wanted from  Tγ (Tγis just the centers of these balls). To see why every point is in at least one such ball, if some  x ∈ Rd of unit norm is not contained in any doubled ball then a γ/2-ball about x would be disjoint from our maximally packed  γ/2 balls, a contradiction. ■

Lemma 4. If D satisfies the (ε, δ, p)-JL moment property, then D satisfies the (2ε, 9dδ, d, p)-OSEmoment property

Proof. By Theorem 6, there exists a subset  X ⊂ Rn of at most 9d points such that

image

It is known that if D is a distribution over  Rm×n with m = Ω(log(1/δ)/ε2) and for Π  ∼ D,the entries of Π are independent subgaussians with mean zero and variance 1/m, then D has the (ε/2, δ, Θ(log(1/δ)))-JL moment property [KMN11]. Thus such a matrix has the (ε, δ, d, Θ(d +log(1/δ)))-OSE moment property for  δ < 2−d by Lemma 4.

image

Recall the SRHT is the  m × nmatrix Π = (1/√m) · SHD for na power of 2 where D has diagonal entries  α1, . . . , αnthat are independent and uniform in  {−1, 1}, His the unnormalized Hadamard transform with  Hi,j = (−1)⟨i,j⟩ (treating i, jas elements of the vector space  Flog2 n2 ), and S is asampling matrix. That is, the rows of S are independent, and each row has a 1 in a uniformly random location and zeroes elsewhere. A similar construction is where  S is an n × n diagonalmatrix with  Si,i = ηibeing independent Bernoulli random variables each of expectation m/n (so that, in expectation, S selects m rows from HD). We will here show the moment property for this latter variant since it makes the notation a tad cleaner, though the analysis we present holds essentially unmodified for the former variant as well.

Our analysis below implies that the SRHT provides an  ε-subspace embedding for d-dimensional subspaces with failure probability  δ for m = O(ε−2(d + log(1/(εδ))) log(d/δ)). This is an improvement over analyses we have found in previous works. The analysis in [Tro11] only considers constant ε and δ = O(1/d) and for these settings achieves m = O((d+log n) log d), which is still slightly worse than our bound for this setting of  ε, δ(our bound removes the log n and achieves any 1/ poly(d) failure probability with the same m). The analysis in [LDFU13] only allows failure probabilities greater than  n/ed. They show failure probability  δ + n/ed is achieved for  m = O(d log(d/δ)/ε2),which is also implied by our result if  m ≤ n(which is certainly the case in applications for the SRHT to be useful, since otherwise one could use the  n × nidentity matrix as a subspace embedding). The reason for these differences is that previous works operate by showing HDU has small row norms with high probability over D; since there are n rows, some logarithmic dependence on n shows up in a union bound. After this conditioning, one then shows that S works. Our analysis does not do any such conditioning at all. Interestingly, such a lossy conditioning approach was done even for the case d = 1 [AC09]. As we see below, these analyses can be improved (essentially the log n terms that appear from the conditioning approach can be very slightly improved to log m).

Our main motivation in re-analyzing the SRHT was not to improve the bounds, but simply to clearly demonstrate that the SRHT satisfies the OSE moment property. The fact that our moment based analysis below (very slightly) improved m was a fortunate accident. Before we present our proof of the OSE moment property for the SRHT, we state a theorem we will use. For a random matrix M, we henceforth use  ∥M∥pto denote (E ∥M∥pSp)1/p where ∥M∥Sp is the Schatten-p norm, i.e. the  ℓpnorm of the singular values of M.

Theorem 7 (Non-commutative Khintchine inequality [LP86, LPP91]). Let X1, . . . , Xnbe fixed real matrices and  σ1, . . . , σnbe independent Rademachers. Then

image

We will also make use of the Hanson-Wright inequality.

image

We now present our main analysis of this subsection.

Theorem 9. The SRHT satisfies the (ε, δ, d, p)-moment property for  p = log(d/δ)as long as m ≳ ε−2(d log(d/δ) + log(d/δ) log(m/δ)) ≃ ε−2(d + log(1/(εδ)) log(d/δ)).

Proof. For a fixed  U ∈ Rn×d with orthonormal columns, we would like to bound

image

Since  p ≥ log d we have

image

by H¨older’s inequality. Also, let  z1, . . . , znbe the rows of HDU, as column vectors, so that

image

Note also �i zizTi = (HDU)T HDU = n · I for any D, so the identity matrix is the expectation, over  η, of the right hand side of Eq. (21) for any D. Thus we are left wanting to bound

image

where the  η′i are identically distributed as the  ηibut independent of them. Below we use  ∥f(X)∥Lp(X)to denote (EX |f(X)|p)1/p. Also we assume p is an integer multiple of 4, so that  ∥A∥Spfor real symmetric  A equals (tr(Ap))1/p and ∥A∥Sp/2 = (tr(Ap/2))2/p. Thus for (σi) independent Rademachers,

image

image

implying that for some fixed constant C > 0, we have  Q2 − CRQ − CR ≤0. This implies that Q is at most the larger root of the associated quadratic equation, i.e.  Q ≲ max{√R, R}, or equivalently

image

where ˜Uiis the matrix with ( ˜Ui)k,j = Hi,k · Uk,j. Of particular importance for us is the identity ˜U Ti ˜Ui = I. Then by Eq. (26) and Theorem 8,

image

image

so that

image

which when combined with Eq. (25) gives

image

Thus the OSE moment property is satisfied by our choices of m, p in the theorem statement. ■

image

As discussed in Remark 3, to obtain both a good number of rows for Π as well as fast multiplication for ΠA, ΠB, one may wish to set Π as the composition Π = Π1Π2, where Π1has the correct number m1 = O(k/ε2) of rows (e.g. a matrix of subgaussian entries), whereas Π2maps to a small but suboptimal number  m2of rows (e.g. the SRHT) but supports fast embedding to compute Π2A. Weshow here that composing maps each supporting AMM yields a final map also giving AMM.

As discussed in Corollary 1, without loss of generality we can assume A = B. Also, as discussed in Remark 1, we can focus on achieving Eq. (3) where the number of rows of Π should depend on the stable rank ˜r and not rank r of A. The key is to note the following simple triangle inequality:

image

The results of this work show that to achieve the desired  β ≤ ε∥A∥2, it suffices that the number of rows of Π2need only depend on ˜r and not r, as desired. The trouble is that for  α, the number of rows of Π1will need to depend on the stable rank ˜r′ of Π2A and not ˜r. Furthermore, the error will be  α ≤ ε∥Π2A∥2 and not α ≤ ε∥A∥2. Thus, we must obtain good bounds on both ˜r′ and ∥Π2A∥.To achieve this, note

image

and

image

above), then indeed we have  ∥Π2A∥ = Θ(∥A∥) by Eq. (28). Also, [KN14, Theorem 6.2] implies ∥(Π2A)T (Π2A) − AT A∥F ≤ ε∥A∥2F with probability 1  − δas long as Π comes from a distribution satisfying the (O(ε), δ, ℓ)-JL moment property for some  ℓ ≥2 (which is just the (O(ε), δ, 1, ℓ)-OSEmoment property in the terminology of this work). If this holds, then  ∥Π2A∥F = Θ(∥A∥F ) byEq. (29), and thus ˜r′ = Θ(˜r), as desired. Then overall, we have that the left hand side of Eq. (27) is at most  ε · ∥Π2A∥2 + ε · ∥A∥2 = O(ε) · ∥A∥2 as desired, in which both Π1 and Π2need only provide AMM with error  εfor matrices both of stable rank  O(˜r).

Remark 4. An even slicker argument that works in the case when Π1, Π2are both drawn from distributions satisfying the (ε, δ, k, ℓ)-OSE moment property is to observe that the distribution of the product Π1Π2itself satisfies the OSE moment property. Indeed, letting  ∥Z∥p denote (E |Z|p)1/pfor a scalar random variable Z, and letting  U ∈ Rn×k denote a matrix with orthonormal columns,

image

In the first line we used that when A = B in Lemma 2, Π1need only satisfy the OSE moment property with parameter k instead of 2k (since then the span of the columns of both A and B has dimension at most k). Thus the distribution of the product Π1Π2satisfies the (O(ε), O(δ), k, ℓ)-OSEmoment property.


Designed for Accessibility and to further Open Science