b

DiscoverSearch
About
My stuff
Understanding Sparse JL for Feature Hashing
2019·arXiv
Abstract
Abstract

Feature hashing and other random projection schemes are commonly used to reduce the dimensionality of feature vectors. The goal is to efficiently project a high-dimensional feature vector living in  Rn into amuch lower-dimensional space  Rm, while approximately preserving Euclidean norm. These schemes can be constructed using sparse random projections, for example using a sparse Johnson-Lindenstrauss (JL) transform. A line of work introduced by Weinberger et. al (ICML ’09) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors with small  ℓ∞-to-ℓ2norm ratio. Recently, Freksen, Kamma, and Larsen (NeurIPS ’18) closed this line of work by proving a tight tradeoff between  ℓ∞-to-ℓ2norm ratio and accuracy for sparse JL with sparsity 1.

In this paper, we demonstrate the benefits of using sparsity s greater than 1 in sparse JL on feature vectors. Our main result is a tight tradeoff between  ℓ∞-to-ℓ2norm ratio and accuracy for a general sparsity s, that significantly generalizes the result of Freksen et. al. Our result theoretically demonstrates that sparse JL with s > 1 can have significantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this finding.

Feature hashing and other random projection schemes are influential in helping manage large data [11]. The goal is to reduce the dimensionality of feature vectors: more specifically, to project high-dimensional feature vectors living in  Rninto a lower dimensional space  Rm(where  m ≪ n), while approximately preserving Euclidean distances (i.e.  ℓ2distances) with high probability. This dimensionality reduction enables a classifier to process vectors in  Rm, instead of vectors in  Rn. In this context, feature hashing was first introduced by Weinberger et. al [29] for document-based classification tasks such as email spam filtering. For such tasks, feature hashing yields a lower dimensional embedding of a high-dimensional feature vector derived from a bag-of-words model. Since then, feature hashing has become a mainstream approach [28], applied to numerous domains including ranking text documents [4], compressing neural networks [7], and protein sequence classification [5].

Random Projections

Dimensionality reduction schemes for feature vectors fit nicely into the random projection literature. In fact, the feature hashing scheme proposed by Weinberger et al. [29] boils down to uniformly drawing a random m × nmatrix where each column contains one nonzero entry, equal to  −1or 1.

The  ℓ2-norm-preserving objective can be expressed mathematically as follows: for error  ϵ > 0and failure probability  δ, the goal is to construct a probability distribution A over  m × nreal matrices that satisfies the following condition for vectors  x ∈ Rn:

image

The result underlying the random projection literature is the Johnson-Lindenstrauss lemma, which gives an upper bound on the dimension m achievable by a probability distribution A satisfying (1):

Lemma 1.1 (Johnson-Lindenstrauss [16]) For any  n ∈ Nand  ϵ, δ ∈ (0, 1), there exists a probability distribution A over  m × nmatrices, with  m = Θ(ϵ−2 ln(1/δ)), that satisfies (1).

The optimality of the dimension m achieved by Lemma 1.1 has been proven [17, 15].

To speed up projection time, it is useful to consider probability distributions over sparse matrices (i.e. matrices with a small number of nonzero entries per column). More specifically, for matrices with s nonzero entries per column, the projection time for a vector x goes down from  O(m ∥x∥0)to  O(s ∥x∥0), where  ∥x∥0is the number of nonzero entries of x. In this context, Kane and Nelson [19] constructed sparse JL distributions (which we define formally in Section 1.1), improving upon previous work [2, 22, 12]. Roughly speaking, a sparse JL distribution, as constructed in [19], boils down to drawing a random  m × nmatrix where each column contains exactly s nonzero entries, each equal to  −1/√sor  1/√s. Kane and Nelson show that sparse JL distributions achieve the same (optimal) dimension as Lemma 1.1, while also satisfying a sparsity property.

Theorem 1.2 (Sparse JL [19]) For any  n ∈ Nand  ϵ, δ ∈ (0, 1), a sparse JL distribution  As,m,n(defined formally in Section 1.1) over  m×nmatrices, with dimension  m = Θ(ϵ−2 ln(1/δ))and sparsity  s = Θ(ϵ−1 ln(1/δ)), satisfies (1).

Sparse JL distributions are state-of-the-art sparse random projections, and achieve a sparsity that is nearly optimal when the dimension m is  Θ(ϵ−2 ln(1/δ)).1However, in practice, it can be necessary to utilize a lower sparsity s, since the projection time is linear in s. Resolving this issue, Cohen [8] extended the upper bound in Theorem 1.2 to show that sparse JL distributions can achieve a lower sparsity with an appropriate gain in dimension. He proved the following dimension-sparsity tradeoffs:

Theorem 1.3 (Dimension-Sparsity Tradeoffs [8]) For any  n ∈ Nand  ϵ, δ ∈ (0, 1), a uniform sparse JL distribution  As,m,n(defined formally in Section 1.1), with  s ≤ Θ(ϵ−1 ln(1/δ))and

image

satisfies (1).

Connection to Feature Hashing

Sparse JL distributions have particularly close ties to feature hashing. In particular, the feature hashing scheme proposed by Weinberger et al. [29] can be viewed as a special case of sparse JL, namely with s = 1. Interestingly, in practice, feature hashing can do much better than theoretical results, such as Theorem 1.2 and Theorem 1.3, would indicate [13]. An explanation for this phenomenon is that the highest error terms in sparse JL stem from vectors with mass concentrated on a very small number of entries, while in practice, the mass on feature vectors may be spread out between many coordinates. This motivates studying the tradeoff space for vectors with low  ℓ∞-to-ℓ2ratio.

More formally, take  Svto be�x ∈ Rn | ∥x∥∞∥x∥2 ≤ v�, so that  S1 = Rnand  Sv ⊊ Swfor  0 ≤ v < w ≤ 1. Let  v(m, ϵ, δ, s)be the supremum over all  0 ≤ v ≤ 1such that a sparse JL distribution with sparsity s and dimension m satisfies (1) for each  x ∈ Sv. (That is,  v(m, ϵ, δ, s)is the maximum  v ∈ [0, 1]such that for every x ∈ Rn, if  ∥x∥∞ ≤ v ∥x∥2then (1) holds.2) For s = 1, a line of work [29, 12, 18, 10, 19] improved bounds on v(m, ϵ, δ, 1), and was recently closed by Freksen et al. [13].

Theorem 1.4 ([13]) For any  m ∈ Nand  ϵ, δ ∈ (0, 1), the function  v(m, ϵ, δ, 1)is equal to  f(m, ϵ, ln(1/δ))where:

image

Generalizing to Sparse Random Projections with s > 1

While Theorem 1.4 is restricted to the case of s = 1, dimensionality reduction schemes constructed using sparse random projections with sparsity s > 1 have been used in practice for projecting feature vectors. For example, sparse JL-like methods (with s > 1) have been used to project feature vectors in machine learning domains including visual tracking [27], face recognition [23], and recently in ELM [6]. Now, a variant of sparse JL is included in the Python sklearn library.3

In this context, it is natural to explore how constructions with s > 1 perform on feature vectors, by studying v(m, ϵ, δ, s)for sparse JL with s > 1. In fact, a related question was considered by Weinberger et al. [29] for “multiple hashing,” an alternate distribution over sparse matrices constructed by adding s draws from  A1,m,nand scaling by  1/√s. More specifically, they show that  v(m, ϵ, δ, s) ≥ min(1, √s · v(m, ϵ, δ, 1))for multiple hashing. However, Kane and Nelson [19] later showed that multiple hashing has worse geometry-preserving properties than sparse JL: that is, multiple hashing requires a larger sparsity than sparse JL to satisfy (1).

Characterizing  v(m, ϵ, δ, s)for sparse JL distributions, which are state-of-the-art, remained an open problem. In this work, we settle how  v(m, ϵ, δ, s)behaves for sparse JL with a general sparsity s > 1, giving tight bounds. Our theoretical result shows that sparse JL with s > 1, even if s is a small constant, can achieve significantly better norm-preservation properties for feature vectors than sparse JL with s = 1. Moreover, we empirically demonstrate this finding.

Main Results

We show the following tight bounds on  v(m, ϵ, δ, s)for a general sparsity s:

Theorem 1.5 For any  s, m ∈ Nsuch that  s ≤ m/e, consider a uniform sparse JL distribution (defined in Section 1.1) with sparsity s and dimension  m.4If  ϵand  δare small enough5, the function  v(m, ϵ, δ, s)is equal

image

Our main result, Theorem 1.5, significantly generalizes Theorem 1.2, Theorem 1.3, and Theorem 1.4. Notice our bound in Theorem 1.5 has up to four regimes. In the first regime, which occurs when  m ≥min(2ϵ−2/δ, ϵ−2 ln(1/δ)eΘ(max(1,ln(1/δ)ϵ−1/s))), Theorem 1.5 shows  v(m, ϵ, δ, s) = 1, so (1) holds on the full space  Rn. Notice this boundary on m occurs at the dimensionality-sparsity tradeoff in Theorem 1.3. In the last regime, which occurs when  m ≤ Θ(ϵ−2 ln(1/δ)), Theorem 1.5 shows that  v(m, ϵ, δ, s) = 0, so there are vectors with arbitrarily small  ℓ∞-to-ℓ2norm ratio where (1) does not hold. When  s ≤ Θ(ϵ−1 ln(1/δ)), Theorem 1.5 shows that up to two intermediate regimes exist. One of the regimes,  Θ(√ϵs min(ln( mϵp )/p,�ln( mϵ2p )/√p)), matches the middle regime of  v(m, ϵ, δ, 1)in Theorem 1.4 with an extra factor of √s, much like the bound for multiple hashing in [29] that we mentioned previously. However, unlike the multiple hashing bound, Theorem 1.5 sometimes has another regime,  Θ(√ϵs�ln( mϵ2p )/√p), which does not arise for s = 1 (i.e. in Theorem 1.4).7Intuitively, we expect this additional regime for sparse JL with s close to  Θ(ϵ−1 ln(1/δ)): at s = Θ(ϵ−1 ln(1/δ))and  m = Θ(ϵ−2 ln(1/δ)), Theorem 1.2 tells us  v(m, ϵ, δ, s) = 1, but if  ϵis a constant, then the branch  Θ(√ϵs ln�mϵp�/p)yields  Θ(1/�ln(1/δ)), while the branch  Θ(√ϵs�ln( mϵ2p )/√p)yields  Θ(1). Thus, it is natural that the first branch disappears for large m.

Our result elucidates that  v(m, ϵ, δ, s)increases approximately as √s, thus providing insight into how even small constant increases in sparsity can be useful in practice. Another consequence of our result is a lower bound on dimension-sparsity tradeoffs (Corollary A.1 in Appendix A) that essentially matches the upper bound in Theorem 1.3. Moreover, we require new techniques to prove Theorem 1.5, for reasons that we discuss further in Section 1.2.

We also empirically support our theoretical findings in Theorem 1.5. First, we illustrate with real-world datasets the potential benefits of using small constants s > 1 for sparse JL on feature vectors. We specifically show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the  ℓ2norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. More specifically, we empirically show that  v(m, ϵ, δ, s)is not smooth, and that the middle regime(s) of  v(m, ϵ, δ, s)increases with s.

1.1 Preliminaries

Let  As,m,nbe a sparse JL distribution if the entries of a matrix  A ∈ As,m,nare generated as follows. Let Ar,i = ηr,iσr,i/√swhere  {σr,i}r∈[m],i∈[n]and  {ηr,i}r∈[m],i∈[n]are defined as follows:

The families  {σr,i}r∈[m],i∈[n]and  {ηr,i}r∈[m],i∈[n]are independent from each other.

The variables  {σr,i}r∈[m],i∈[n]are i.i.d Rademachers (±1coin flips).

The variables  {ηr,i}r∈[m],i∈[n]are identically distributed Bernoullis ({0, 1} random variables) with expectation s/m.

The  {ηr,i}r∈[m],i∈[n]are independent across columns but not independent within each column. For every column  1 ≤ i ≤ n, it holds that �mr=1 ηr,i = s. Moreover, the random variables are negatively correlated: for every subset  S ⊆ [m]and every column  1 ≤ i ≤ n, it holds that  E��r∈S ηr,i�≤ �r∈S E[ηr,i].

A common special case is a uniform sparse JL distribution, generated as follows: for every  1 ≤ i ≤ n, we uniformly choose exactly s of these variables in  {ηr,i}r∈[m]to be 1. When s = 1, every sparse JL distribution is a uniform sparse JL distribution, but for s > 1, this is not the case.

Another common special case is a block sparse JL distribution. This produces a different construction for s > 1. In this distribution, each column  1 ≤ i ≤ nis partitioned into s blocks of� ms�consecutive rows. In each block in each column, the distribution of the variables  {ηr,i}is defined by uniformly choosing exactly one of these variables to be  1.8

1.2 Proof Techniques

We use the following notation. For any random variable X and value  q ≥ 1, we call  E[|X|q]the qth moment of X, where E denotes the expectation. We use  ∥X∥qto denote the q-norm  (E[|X|q])1/q.

For every  [x1, . . . , xn] ∈ Rnsuch that  ∥x∥2 = 1, we need to analyze tail bounds of an error term, which for the sparse JL construction is the following random variable:

image

An upper bound on the tail probability of  R(x1, . . . , xn)is needed to prove the lower bound on  v(m, ϵ, δ, s)in Theorem 1.5, and a lower bound is needed to prove the upper bound on  v(m, ϵ, δ, s)in Theorem 1.5. It turns out that it suffices to tightly analyze the random variable moments  E[(R(x1, . . . , xn))q]. For the upper bound, we use Markov’s inequality like in [13, 19, 3, 24], and for the lower bound, we use the Paley-Zygmund inequality like in [13]: Markov’s inequality gives a tail upper bound from upper bounds on moments, and the Paley-Zygmund inequality gives a tail lower bound from upper and lower bounds on moments. Thus, the key ingredient of our analysis is a tight bound for  ∥R(x1, . . . , xn)∥qon  Sv =�x ∈ Rn | ∥x∥∞∥x∥2 ≤ v�at eachthreshold v value.

While the moments of  R(x1, . . . , xn)have been studied in previous analyses of sparse JL, we emphasize that it is not clear how to adapt these existing approaches to obtain a tight bound on every  Sv. The moment bound that we require and obtain is far more general: the bounds in [19, 9] are limited to  Rn = S1and the bound in [13] is limited to  s = 1.9The non-combinatorial approach in [9] for bounding  ∥R(x1, . . . , xn)∥qon Rn = S1also turns out to not be sufficiently precise on  Sv, for reasons we discuss in Section 2.10

Thus, we require new tools for our moment bound. Our analysis provides a new perspective, inspired by the probability theory literature, that differs from the existing approaches in the JL literature. We believe our style of analysis is less brittle than combinatorial approaches [13, 19, 3, 24]: in this setting, once the sparsity s = 1 case is recovered, it becomes straightforward to generalize to other s values. Moreover, our approach can yield greater precision than the existing non-combinatorial approaches [9, 8, 14], which is necessary for this setting. Thus, we believe that our structural approach to analyzing JL distributions could be of use in other settings.

In Section 2, we present an overview of our methods and the key technical lemmas to analyze  ∥R(x1, . . . , xn)∥q. We defer the proofs to the Appendix. In Section 3, we prove the tail bounds in Theorem 1.5 from these moment bounds. In Section 4, we empirically evaluate sparse JL.

Our approach takes advantage of the structure of  R(x1, . . . , xn)as a quadratic form of Rademachers (i.e. �t1,t2 at1,t2σt1σt2) with random variable coefficients (i.e. where  at1,t2is itself a random variable). For the upper bound, we need to analyze  ∥R(x1, . . . , xn)∥qfor general vectors  [x1, . . . , xn]. For the lower bound, we only need to show  ∥R(x1, . . . , xn)∥qis large for single vector in each  Sv, and we show we can select the vector in the  ℓ2-unit ball with  1/v2nonzero entries, all equal to v. For ease of notation, we denote this vector by [v, . . . , v, 0, . . . , 0] for the remainder of the paper.

We analyze  ∥R(x1, . . . , xn)∥qusing general moment bounds for Rademacher linear and quadratic forms. Though Cohen, Jayram, and Nelson [9] also view  R(x1, . . . , xn)as a quadratic form, we show in Appendix B that their approach of bounding the Rademachers by gaussians is not sufficiently precise for our setting.11

In our approach, we make use of stronger moment bounds for Rademacher linear and quadratic forms, some of which are known to the probability theory community through Latała’s work in [21, 20] and some of which are new adaptions tailored to the constraints arising in our setting. More specifically, Latała’s bounds [21, 20] target the setting where the coefficients are scalars. In our setting, however, the coefficients are themselves random variables, and we need bounds that are tractable to analyze in this setting, which involves creating new bounds to handle some cases.

Our strategy for bounding  ∥R(x1, . . . , xn)∥qis to break down into rows. We define

image

so that  R(x1, . . . , xn) = 1s�mr=1 Zr(x1, . . . , xn). We analyze the moments of  Zr(x1, . . . , xn), and then combine these bounds to obtain moment bounds for  R(x1, . . . , xn). In our bounds, we use the notation  f ≲ g(resp. f ≳ g) to denote  f ≤ Cg(resp.  f ≥ Cg) for some constant C > 0.

2.1 Bounding Zr(x1, . . . , xn)q

We show the following bounds on  ∥Zr(x1, . . . , xn)∥q. For the lower bound, as we discussed before, it suffices to bound  ∥Zr(v, . . . , v, 0, . . . , 0)∥q. For the upper bound, we need to bound  ∥Zr(x1, . . . , xn)∥qfor general vectors as a function of the  ℓ∞-to-ℓ2norm ratio.

Lemma 2.1 Let  As,m,nbe a sparse JL distribution such that  s ≤ m/e. Suppose that  x = [x1, . . . , xn]satisfies ∥x∥∞ ≤ vand  ∥x∥2 = 1. If T is even, then:

image

Lemma 2.2 Let  As,m,nbe a sparse JL distribution. Suppose 1v2and T are even integers. Then,  ∥Zr(v, . . . , v, 0, . . . , 0)∥2 ≳ sm. Moreover, if  s ≤ m/eand  T ≥ semv2, then

image

and

image

We now sketch our methods to prove Lemma 2.1 and Lemma 2.2. For the lower bound (Lemma 2.2), we can view  Zr(v, . . . , v, 0, . . . , 0)as a quadratic form �t1,t2 at1,t2σt1σt2where  (at1,t2)t1,t2∈[mn]is an ap- propriately defined block-diagonal mn dimensional matrix. We can write  Eσ,η[(Zr(v, . . . , v, 0, . . . , 0))q]as Eη [Eσ[(Zr(v, . . . , v, 0, . . . , 0))q]]: for  fixed ηr,ivalues, the coefficients are scalars. We make use of Latała’s tight bound on Rademacher quadratic forms with scalar coefficients [21] to analyze  Eσ[(Zr(v, . . . , v, 0, . . . , 0))q]as a function of the  ηr,i. Then, we handle the randomness of the  ηr,iby taking an expectation of the resulting bound on  Eσ[(Zr(v, . . . , v, 0, . . . , 0))q]over the  ηr,ivalues to obtain a bound on  ∥Zr(v, . . . , v, 0, . . . , 0)∥q.

For the upper bound (Lemma 2.1), since Latała’s bound [21] is tight for scalar quadratic forms, the natural approach would be to use it to upper bound  Eσ[(Zr(x1, . . . , xn))q]for general vectors. However, when the vector is not of the form [v, . . . , v, 0, . . . , 0], the asymmetry makes the resulting bound intractable to simplify. Specifically, there is a term, which can be viewed as a generalization of an operator norm to an  ℓ2ball cut out by  ℓ∞hyperplanes, that becomes problematic when taking an expectation over the  ηr,ito obtain a bound on Eσ,η[(Zr(x1, . . . , xn))q]. Thus, we construct simpler estimates that avoid these complications while remaining sufficiently precise for our setting. These estimates take advantage of the structure of  Zr(x1, . . . , xn)and enable us to show Lemma 2.1.

2.2 Obtaining bounds on R(x1, . . . , xn)q

Now, we use Lemma 2.1 and Lemma 2.2 to show the following bounds on  ∥R(x1, . . . , xn)∥q:

Lemma 2.3 Suppose  As,m,nis a sparse JL distribution such that  s ≤ m/e, and let  x = [x1, . . . , xn]be such that  ∥x∥2 = 1. Then,  ∥R(x1, . . . , xn)∥2 ≤

∥x∥∞ ≤ v. If semv2 ≥ q, then  ∥R(x1, . . . , xn)∥q ≲ √q√m. If semv2 < qand if there exists a constant  C2 ≥ 1such that  C2q3mv4 ≥ s2, then  ∥R(x1, . . . , xn)∥q ≲ gwhere g is:

image

Lemma 2.4 Suppose  As,m,nis a uniform sparse JL distribution. Let q be a power of 2, and suppose that 0 < v ≤ 0.5and 1v2is an even integer. If  qv2 ≤ s, then  ∥R(v, . . . , v, 0, . . . , 0)∥q ≳ √q√m. If  m ≥ q, 2 ≤ln(qmv4/s2) ≤ q, 2qv2 ≤ 0.5s ln(qmv4/s2), and  s ≤ m/e, then  ∥R(v, . . . , v, 0, . . . , 0)∥q ≳ qv2s ln(qmv4/s2). If

image

We now sketch how to prove bounds on  ∥R(x1, . . . , xn)∥qusing bounds on  ∥Zr(x1, . . . , xn)∥T. To show Lemma 2.3, we show that making the row terms  Zr(x1, . . . , xn)independent does not decrease  ∥R(x1, . . . , xn)∥q, and then we apply a general result from [20] for moments of sums of i.i.d symmetric random variables. For Lemma 2.4, handling the correlations between the row terms  Zr(x1, . . . , xn)requires more care. We show that the negative correlations induced by having exactly s nonzero entries per column do not lead to significant loss, and then stitch together  ∥R(v, . . . , v, 0, . . . , 0)∥qusing the moments of  Zr(v, . . . , v, 0, . . . , 0)that contribute the most.

We now sketch how to prove Theorem 1.5, using Lemma 2.3 and Lemma 2.4. First, we simplify these bounds at the target parameters to obtain the following:

Lemma 3.1 Let  As,m,nbe a sparse JL distribution, and suppose  ϵand  δare small enough,  s ≤ m/e, Θ(ϵ−2 ln(1/δ)) ≤ m < 2ϵ−2/δ, v ≤ f ′(m, ϵ, ln(1/δ), s), and  p = Θ(ln(1/δ))is even. If  x = [x1, . . . , xn]satisfies  ∥x∥∞ ≤ vand  ∥x∥2 = 1, then  ∥R(x1, . . . , xn)∥p ≤ ϵ2.

Lemma 3.2 There is a universal constant D satisfying the following property. Let  As,m,nbe a uniform sparse JL distribution, and suppose  ϵ, δare small enough,  s ≤ m/e, f ′(m, ϵ, ln(1/δ), s) ≤ 0.5, and q is an even integer such that  q = min(m/2, Θ(ln(1/δ))). For each  ψ > 0, there exists  v ≤ f ′(m, ϵ, ln(1/δ), s) + ψ, such that ∥R(v, . . . , v, 0, . . . , 0)∥q ≥ 2ϵand ∥R(v,...,v,0,...,0)∥q∥R(v,...,v,0,...,0)∥2q ≥ D.

Now, we use Lemma 3.1 and Lemma 3.2 to prove Theorem 1.5.

Proof of Theorem 1.5. Since the maps in  As,m,nare linear, it suffices to consider unit vectors x. First, we prove the lower bound on  v(m, ϵ, δ, s). To handle  m ≥ 2ϵ−2/δ, we take q = 2 in Lemma 3.1 and apply Chebyshev’s inequality. Otherwise, we take  p = ln(1/δ)(approximately) and apply Lemma 3.1 and Markov’s inequality. We see that  P[| ∥Ax∥22 − 1| ≥ ϵ]can be expressed as:

image

Thus, condition (1) is satisfied for  x ∈ Svwhen  v ≤ f ′(m, ϵ, ln(1/δ), s)as desired.

Now, we prove the upper bound on  v(m, ϵ, δ, s). We need to lower bound the tail probability of R(v, . . . , v, 0, . . . , 0), and to do this, we use the Paley-Zygmund inequality applied to qth moments. Let D be defined as in Lemma 3.2, and take  q = min(m/2, max(2, ln(1/δ)−2−2 ln(D) )). By the Paley-Zygmund inequality and Lemma 3.2, there exists  v ≤ f ′(m, ϵ, ln(1/δ), s) + ψsuch that:

image

Thus, it follows that  supx∈Sf′(m,ϵ,ln(1/δ),s)+ψ,∥x∥2=1 P[| ∥Ax∥22 − 1| > ϵ] > δas desired.

Recall that for sparse JL distributions with sparsity s, the projection time for an input vector x is  O(s ∥x∥0), where  ∥x∥0is the number of nonzero entries in x. Since this grows linearly in s, in order to minimize the impact on projection time, we restrict to small constant s values (i.e.  1 ≤ s ≤ 16). In Section 4.1, we demonstrate on real-world data the benefits of using s > 1. In Section 4.2, we illustrate trends in our theoretical bounds on synthetic data. Additional graphs can be found in Appendix I. For all experiments, we use a block sparse JL distribution to demonstrate that our theoretical upper bounds also empirically generalize to non-uniform sparse JL distributions.

4.1 Real-World Datasets

We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron).12Both datasets were pre-processed with the standard tf-idf preprocessing. In this experiment, we evaluated how well sparse JL preserves the  ℓ2norms of the vectors in the dataset. An interesting direction for future work would be to empirically evaluate how well sparse JL preserves other aspects of the geometry of real-world data sets, such as the  ℓ2distances between pairs of vectors.

In our experiment, we estimated the failure probability ˆδ(s, m, ϵ)for each dataset as follows. Let D be the number of vectors in the dataset, and let n be the dimension (n = 101631, D = 11314 for News20; n = 28102, D = 39861 for Enron). We drew a matrix  M ∼ As,m,nfrom a block sparse JL distribution. Then, we computed ∥Mx∥2∥x∥2for each vector x in the dataset, and used these values to compute an estimate ˆδ(s, m, ϵ) = number of vectors x such that∥Mx∥2∥x∥2 ̸∈1±ϵD. We ran 100 trials to produce 100 estimates ˆδ(s, m, ϵ).

image

Figure 1: News20: ˆδ(m, s, 0.07)v. s

Figure 2: Enron: ˆδ(m, s, 0.07)vs. s

Figure 1 and Figure 2 show the mean and error bars (3 standard errors of the mean) of ˆδ(s, m, ϵ)at  ϵ = 0.07. We consider  s ∈ {1, 2, 4, 8, 16}, and choose m values so that  0.01 ≤ ˆδ(1, m, ϵ) ≤ 0.04.

All of the plots show that  s ∈ {2, 4, 8, 16}achieves a lower failure probability than s = 1, with the differences most pronounced when m is larger. In fact, at m = 1000, there is a factor of four decrease in  δbetween s = 1 and s = 4, and a factor of ten decrease between s = 1 and s = 8, 16. We note that in plots in the Appendix, there is a slight increase between s = 8 and s = 16 at some  ϵ, δ, mvalues (see Appendix I for a discussion of this non-monotonicity in s); however s > 1 still consistently beats s = 1. Thus, these findings demonstrate the potential benefits of using small constants s > 1 in sparse JL in practice, which aligns with our theoretical results.

4.2 Synthetic Datasets

We used synthetic data to illustrate the phase transitions in our bounds on  v(m, ϵ, δ, s)in Theorem 1.5 for a block sparse JL distribution. For several choices of  s, m, ϵ, δ, we computed an estimate  ˆv(m, ϵ, δ, s)of  v(m, ϵ, δ, s)as follows. Our experiment borrowed aspects of the experimental design in [13]. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). The binary vectors were defined by a set W of values exponentially spread between 0.03 and  113: for each  w ∈ W, we constructed a binary vector  xwwhere the first  1/w2entries are nonzero, and computed an estimate ˆδ(s, m, ϵ, w)of the failure probability of the block sparse JL distribution on the specific vector  xw(i.e.  PA∈As,m,1/w2 [∥Axw∥2 ̸∈ (1 ± ϵ) ∥xw∥2]). We computed each ˆδ(s, m, ϵ, w)using 100,000 samples from a block sparse JL distribution, as follows. In each sample, we independently drew a matrix  M ∼ As,m,1/w2and computed the ratio ∥Mxw∥2∥xw∥2. Then, we took ˆδ(s, m, ϵ, w) := (number of samples where ∥Mxw∥2∥xw∥2 ̸∈ 1 ± ϵ)/T. Finally, we used the estimates ˆδ(s, m, ϵ, w)to obtain the estimate  ˆv(m, ϵ, δ, s) = max�v ∈ W | ˆδ(s, m, ϵ, w) < δfor all  w ∈ Wwhere  w ≤ v�.

image

Figure 3: Phase transitions of  ˆv(m, 0.1, 0.01, s)

Figure 4: Phase transitions of  ˆv(m, 0.05, 0.05, s)

Figure 3 and Figure 4 show  ˆv(m, ϵ, δ, s)as a function of dimension m for  s ∈ {1, 2, 3, 4, 8}for two settings of  ϵand  δ. The error-bars are based on the distance to the next highest v value in W.

Our first observation is that for each set of  s, ϵ, δvalues considered, the curve  ˆv(m, ϵ, δ, s)has “sharp” changes as a function of m. More specifically,  ˆv(m, ϵ, δ, s)is 0 at small m, then there is a phase transition to a nonzero value, then an increase to a higher value, then an interval where the value appears “flat”, and lastly a second phase transition to 1. The first phase transition is shared between s values, but the second phase transition occurs at different dimensions m (but is within a factor of 3 between s values). Here, the first phase transition likely corresponds to  Θ(ϵ−2 ln(1/δ))and the second phase transition likely corresponds to  min�ϵ−2eΘ(ln(1/δ)), ϵ−2 ln(1/δ)eΘ(ln(1/δ)ϵ−1/s)�.

Our second observation is that as s increases, the “flat” part occurs at a higher y-coordinate. Here, the increase in the “flat” y-coordinate as a function of s corresponds to the √sterm in  v(m, ϵ, δ, s). Technically, according to Theorem 1.5, the “flat” parts should be increasing in m at a slow rate: the empirical “flatness” likely arises since W is a finite set in the experiments.

Our third observation is that s > 1 generally outperforms s = 1 as Theorem 1.5 suggests: that is, s > 1 generally attains a higher  ˆv(m, ϵ, δ, s)value than s = 1. We note at large m values (where  ˆv(m, ϵ, δ, s)is close to 1), lower s settings sometimes attain a higher  ˆv(m, ϵ, δ, s)than higher s settings (e.g. the second phase transition doesn’t quite occur in decreasing order of s in Figure 3): see Appendix I for a discussion of this non-monotonicity in  s.15Nonetheless, in practice, it’s unlikely to select such a large dimension m, since the ℓ∞-to-ℓ2guarantees of smaller m are likely sufficient. Hence, a greater sparsity generally leads to a better ˆv(m, ϵ, δ, s)value, thus aligning with our theoretical findings.

[1] The 20 newsgroups text dataset. https://scikit-learn.org/0.19/datasets/twenty_newsgroups. html.

[2] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, June 2003.

[3] Z. Allen-Zhu, R. Gelashvili, S. Micali, and N. Shavit. Sparse sign-consistent Johnson–Lindenstrauss matrices: Compression with neuroscience-based constraints. In Proceedings of the National Academy of Sciences (PNAS), volume 111, pages 16872–16876, 2014.

[4] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. Learning to rank with (a lot of) word features. Information Retrieval, 13(3):291–314, Jun 2010.

[5] C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classification using feature hashing. Proteome Science, 10(1), 2012.

[6] C. Chen, C. Vong, C. Wong, W. Wang, and P. Wong. Efficient extreme learning machine via very sparse random projection. Soft Computing, 22, 03 2018.

[7] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. Proceedings of the 32nd Annual International Conference on Machine Learning (ICML), pages 2285–2294, 2015.

[8] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 278–287, 2016.

[9] M. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson-Lindenstrauss transform. In Proceedings of the 1st Symposium on Simplicity in Algorithms (SOSA), pages 1–9, 2018.

[10] S. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity estimation and dimen- sionality reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pages 6618–6628, 2017.

[11] B. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms. Big Data, 1(2):110–112, 2013.

[12] A. Dasgupta, R. Kumar, and T. Sarlos. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341–350, 2010.

[13] C. Freksen, L. Kamma, and K. G. Larsen. Fully understanding the hashing trick. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 5394–5404, 2018.

[14] M. Jagadeesan. Simple analysis of sparse, sign-consistent JL. In Proceedings of the 23rd International Conference and 24th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 61:1–61:20, 2019.

[15] T.S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and steaming problems with subconstant error. In ACM Transactions on Algorithms (TALG) - Special Issue on SODA’11, volume 9, pages 1–26, 2013.

[16] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.

[17] D. M. Kane, R. Meka, and J. Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In Proceedings of the 14th International Workshop and 15th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 628–639, 2011.

[18] D. M. Kane and J. Nelson. A derandomized sparse Johnson-Lindenstrauss transform. CoRR, abs/1006.3585, 2010.

[19] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 16872–16876. ACM Press, 2012.

[20] R. Latała. Estimation of moments of sums of independent real random variables. Annals of Probability, 25(3):1502–1513, 1997.

[21] R. Latała. Tail and moment estimates for some types of chaos. Studia Mathematica, 135(1):39–53, 1999.

[22] P. Li, T. Hastie, and K. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 287–296, 2006.

[23] C. Ma, J. Jung, S. Kim, and S. Ko. Random projection-based partial feature extraction for robust face recognition. Neurocomputing, 149:1232 – 1244, 2015.

[24] J. Nelson and H.L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117–126, 2013.

[25] J. Nelson and H.L. Nguyen. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pages 101–110, 2013.

[26] D. Newman. Bag of words data set. https://archive.ics.uci.edu/ml/datasets/Bag+of+Words, 2008.

[27] H. Song. Robust visual tracking via online informative feature selection. Electronics Letters, 50(25):1931– 1932, 2014.

[28] S. Suthaharan. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, volume 36 of Integrated Series in Information Systems. Springer US, Boston, MA, 2016.

[29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1113–1120, 2009.

In Appendix A, we prove our corollary regarding dimension-sparsity tradeoffs and discuss some of the subtleties of Theorem 1.5. In Appendix B, we show that the Hanson-Wright bound is too loose to prove Theorem 1.5. In Appendix C, we state and prove useful moment bounds that we use throughout the analysis. In Appendix D, we prove our moment bounds for  Zr(x1, . . . , xn)in Lemma 2.1 and Lemma 2.2. In Appendix E, we prove our moment bounds for  R(x1, . . . , xn)in Lemma 2.3 and Lemma 2.4. In Appendix F, we prove auxiliary lemmas needed in the proof of Lemma 2.3. In Appendix G, we prove auxiliary lemmas needed in the proof of Lemma 2.4. In Appendix H, we prove our simplified moment bounds for  R(x1, . . . , xn)in Lemma 3.1 and Lemma 3.2. In Appendix I, we provide additional experimental results on real-world and synthetic datasets as well as additional discussion.

We discuss some of the subtleties of Theorem 1.5. When  m ≥ min(2ϵ−2ep, ϵ−2peΘ(max(1,pϵ−1/s))), where p = ln(1/δ), we show that  v(m, ϵ, δ, s) = 1, which means that the norm-preserving condition holds on the full space. This generalizes Cohen’s bound [8] to a slightly more general family of sparse JL distributions, as we discuss below. When  m ≤ Θ(ϵ−2 ln(1/δ)), we show that  v(m, ϵ, δ, s) = 0. For the remaining regimes, √ϵs�ln( mϵ2p )/√pand  √ϵs min�ln( mϵp )/p,�ln( mϵ2p )/√p�, our upper and lower bounds on  v(m, ϵ, δ, s)match up to constant factors.

In terms of the boundaries between regimes, we emphasize that in Theorem 1.5, the function  f ′(m, ϵ, δ, s)may not be defined for certain intervals between the boundaries of regimes, since there may be different absolute constants in different boundaries. More specifically, these intervals are  C1ϵ−2p ≤ m ≤ C2ϵ−2p, ϵ−2eC1p ≤ m ≤ 2ϵ−2ep, and  s · eC1 max(1,pϵ−1/s) ≤ m ≤ s · eC2 max(1,pϵ−1/s). These gaps arise because the boundaries between the regimes on our upper and lower bounds on  v(m, ϵ, δ, s)can have different absolute constants, so we don’t have precise control on  v(m, ϵ, δ, s)in these gaps. Nonetheless, the gaps only span a constant factor range on the exponent in the dimension m.

We now state the dimension-sparsity tradeoffs that follow from our bounds:

Corollary A.1 Suppose that  ϵand  δare sufficiently small and  s ≤ m/e. If  As,m,nis any sparse JL distribution, then  v(m, ϵ, δ, s) = 1when  m ≥ min�2ϵ−2/δ, ϵ−2 ln(1/δ)eΘ(max(1,ln(1/δ)ϵ−1/s))�. If  As,m,nis a uniform sparse

JL distribution, then  v(m, ϵ, δ, s) ≤ 1/2when  m ≤ min�ϵ−2eΘ(ln(1/δ)), ϵ−2 ln(1/δ)eΘ(max(1,ln(1/δ)ϵ−1/s))�, apart from a constant-factor interval  C1ϵ−2 ln(1/δ) ≤ m ≤ C2ϵ−2 ln(1/δ)where we do not have a bound on the behavior of sparse JL.

Proof of Corollary A.1. The first statement follows from the fact the lower bound in Theorem 1.5 holds for any sparse JL distribution. For the upper bound, we also use Theorem 1.5. Let’s set  Cv√ϵs√ln(mϵ2√p = 12, where  Cv

is the implicit constant in the upper bound. This solves to  m = ϵ−2pe

We also have the condition that  m ≤ ϵ−2eΘ(ln(1/δ))for this regime to be reached. We can obtain the max with 1 on the exponent, by using that  v(m, ϵ, δ, s) = 0when  m ≤ Θ(ϵ−2 ln(1/δ)). To avoid having a gap when m = s·eΘ(max(1,ln(1/δ)ϵ−1/s)), we implicitly use that our lower bound actually doesn’t have a gap between these regimes (though there may be a gap in the boundary between the lower bound and upper bound). Thus, we only have to keep the gap  C1ϵ−2 ln(1/δ) ≤ m ≤ C2ϵ−2 ln(1/δ)where we do not have a lower bound.

Notice that the upper and lower bounds in Corollary A.1 also match up to constant factors on the exponent in the dimension m.

Though Cohen, Jayram, and Nelson [9] also view  R(x1, . . . , xn)as a quadratic form, we show that their approach is not sufficiently precise for our setting. They upper bound the moments of  R(x1, . . . , xn)by the

gaussian case through considering:

image

where the  gr,iare i.i.d standard gaussians. They use the fact Rademachers are subgaussian to conclude that ∥R(x1, . . . , xn)∥q ≤��� ˜R(x1, . . . , xn)���q. In order to obtain upper bounds on��� ˜R(x1, . . . , xn)���q, they use the Hanson-Wright bound, a tight bound on moments of gaussian quadratic forms. However, we need different technical tools for two reasons.

1. First, in order to upper bound  v(m, ϵ, δ, s), we need to  lower bound ∥|R(x1, . . . , xn)∥q, and thus cannot simply consider��� ˜R(x1, . . . , xn)���q.

2. Second, even to lower bound  v(m, ϵ, δ, s), using��� ˜R(x1, . . . , xn)���qas a upper bound for  ∥R(x1, . . . , xn)∥qis not sufficiently strong. Below, we give a counter-example, i.e. a vector x, where��� ˜R(x1, . . . , xn)���qis too large to recover a tight lower bound.

Thus, we cannot use the Hanson-Wright bound in this setting, and need to come up with a better bound on ∥R(x1, . . . , xn)∥qthat does not implicitly replace Rademachers by gaussians. The second point is similar in flavor to the conceptual point made in [14], where a sign-consistent variant of sparse JL was analyzed using an upper bound for Rademacher quadratic forms. However, the bound in [14] also turns out to be loose in this setting and also can’t be used to obtain either a sufficiently tight upper bound or a lower bound for R(x1, . . . , xn).

We now show point (2): that the Hanson-Wright bound is not sufficiently strong to obtain a lower bound v(m, ϵ, δ, s). We consider ˜R(x1, . . . , xn) = 1s�mr=1�i̸=j ηr,iηr,jgr,igr,jxixj, as above, where the  gr,iare i.i.d standard gaussians. We consider p equal to  ln(1/δ)rounded up to the nearest even integer, and we consider a vector of the form [v, . . . , v, 0, . . . , 0] where 1v2is an integer and  v ≥ 0. We show��� ˜R(v, . . . , v, 0, . . . , 0)���p ≳ ω(ϵ)for a certain v value, where we know it to be true that  ∥R(v, . . . , v, 0, . . . , 0)∥p ≲ ϵ.

Let’s consider a vector [v, . . . , v, 0, . . . , 0] where 1v2is an integer and  v ≥ 0. We apply the Hanson-Wright bound (which is tight for gaussians) to obtain:

image

Let  M = �Ni=1 η1,i. Let  S ⊆ [N]be the set of indices where  η1,i = 1. We can set the vector to  xi = yi = 1√Mfor all  i ∈ Sand 0 elsewhere. This gives us:

image

We can expand out this moment to obtain:

image

Since  M ≤ p, we know that� pM�M ≥ 1. Moreover, as long as  p ≥ semv2, we know that�1 − sm�M/p ≥�1 − sm�N/p ≥�1 − sm� ms ≥ 0.3. Thus we obtain a bound of

image

We show that when s = 1, the bound  v = √ϵln( mϵp )pwill produce  ∥R(v, . . . , v, 0, . . . , 0)∥p ≳ ω(ϵ). At this v value, we know that:

image

The key quadratic form bound for Rademachers that we use is:

Lemma C.1 Let T be an even integer,  {σi}1≤i≤nbe independent Rademachers, and  (Yi,j)1≤i,j≤nbe a  n × nsymmetric, nonnegative random matrix with zero diagonal (i.e.  Yi,i = 0) such that  {Yi,j}1≤i,j≤nis independent from  {σi}1≤i≤n. If  Wi =��1≤j≤n Y 2i,j, then:

image

where W(1) ≥ W(2) ≥. . .  ≥. . . W(n)is a permutation of W1, . . . , Wn.

We derive Lemma C.1 from Latała’s bound on Rademacher quadratic forms [21]. In fact, Latała shows moment bounds for much more general quadratic forms, but for the application to JL, we only need the following bound in the special case of Rademachers:

Lemma C.2 ([21]) Let T be an even natural number. Let  σ1, . . . , σnbe independent Rademachers and let (ai,j)a symmetric matrix with zero diagonal. Then:

image

where Ai =��1≤j≤na2i,jand A(1) ≥ A(2). . .  ≥. . . A(n)is a permutation of A1, . . . , An.

To prove Lemma C.1, we apply Lemma C.2 to the case where the  ai,jare themselves random variables:

image

where the last line follows from the fact that the the  Yi,jare nonnegative, so each term is nonnegative, so the triangle inequality results in at most a factor of 2 of gain.

Now, we consider linear forms of symmetric random variables. Theoretically, moments of these forms can be derived from Theorem 2 in [20] (a tight bound on moments of weighted sums of symmetric random variables). However, reducing the tight bound to the form that we want would require some simplifications. Instead, we give a direct proof of a weaker bound that is sufficiently tight for our setting.

Proposition C.3 Suppose that  T ≥ 1is an integer. Suppose that  Y1, Y2, . . . , Ynare i.i.d symmetric random variables and suppose that  x = [x1, . . . , xn]satisfies  ∥x∥2 ≤ 1and  ∥x∥∞ ≤ v. Then, we have that

image

Proof of Proposition C.3. Let  k = 2v�sup1≤t≤TTt� 1T v2�1/(2t) ∥Yi∥2t�. Observe that

image

Now, we use the fact that  |xi| ≤ vand the condition on k to obtain that this is bounded by

image

We now bound moments of squares of linear forms with a zero diagonal, i.e. �i̸=j YiYjxixj. This structure of random variable theoretically falls under the scope of Lemma C.1. However, as mentioned in Section 2.1, the first term of C.1, which is an operator-norm-like term for an asymmetric random matrix in this setting, becomes intractable to manage. We give an alternate (weaker) upper bound that is both tractable to analyze and sufficiently tight for our setting. Our proof of this bound is similar to our proof of Proposition C.3 presented above. Since random variables with a zero diagonal are common in the JL literature [19, 3, 24], we believe this moment bound could be of broader use.

Lemma C.4 Suppose that  Y1, Y2, . . . , Ynare i.i.d symmetric random variables and suppose that  x = [x1, . . . , xn]satisfies  ∥x∥2 = 1and  ∥x∥∞ ≤ v. Let T be an even natural number. Then, we have that

image

Proof of Lemma C.4. Let  k = 2v�sup1≤t≤T/2Tt� 1T v2�1/(2t) ∥Yi∥2t�. Observe that

image

Now, we use the fact that  |xi| ≤ vand the condition on k to obtain that this is bounded by

image

Latała [20] gives the following nice bound on sums of i.i.d symmetric random variables that we use for combining bounds on rows  Zr(x1, . . . , xn)in Lemma 2.3.

Lemma C.5 ([20]) Suppose that q is an even natural number. Suppose that  Y1, . . . , Ynare i.i.d symmetric random variables. Then:

image

We give a general lower bound on moments of certain (potentially correlated) sums of identically distributed random variables, that we use in proving Lemma 2.4.

Proposition C.6 Let  Y1, . . . , Ynbe identically distributed (but not necessarily independent) random variables, such that the joint distribution is a symmetric function of  Y1, . . . , Ynand for any integers  d1, . . . dn ≥ 0, it is true that  E[�1≤i≤n Y dii ] ≥ 0. For any natural number q and natural number T that divides q, it is true that

image

Proof of Proposition C.6. The proof follows from expanding  E[(�ni=1 Yi)q]and using the fact that  E[�1≤i≤n Y dii ] ≥0 so that we can restrict to a subset of the terms. By the symmetry of the joint distribution, we know that for 1 ≤ r1 ̸= r2 ̸= rT ≤ n, we know that  E[Y q/Tr1 . . . Y q/TrT ] = E[Y q/T1 . . . Y q/TT ]. The number of terms of the form E[Y q/Tr1 . . . Y q/TrT ]in  E[(�ni=1 Yi)q]is:

image

This implies that

image

and the statement follows from taking 1/qth powers.

We prove a lemma involving the Paley-Zygmund inequality applied to pth moments, that we use implicitly in the proof of the upper bound in Theorem 1.5.

Lemma C.7 Suppose that K > 0 and Z is a nonnegative random variable, such that  ∥Z∥q ≥ 2Kand  ∥Z∥2qis finite. Then,

image

We use the Paley-Zygmund inequality, which says the following:

Lemma C.8 (Paley-Zygmund) Suppose that Z is a nonnegative random variable with finite variance. Then,

image

Proof of Lemma C.7. We apply Lemma C.8 to  Zpto obtain that:

image

If  ∥Z∥p ≥ 2K, then we know that

image

and then we can apply the above result.

We analyze the moments of  Zr(x1, . . . , xn), proving Lemma 2.2 and Lemma 2.1. Our lower bound in Lemma 2.2 holds for  ∥Zr(v, . . . , v, 0, . . . , 0∥qas well as���Zr(v, . . . , v, 0, . . . , 0)I�1/v2i=1 ηr,i=2

���T(for technical reasons discussed in Appendix E). Our upper bound in Lemma 2.1 holds for  ∥Zr(x1, . . . , xn)∥q. In Appendix D.1, we prove Lemma 2.2. In Appendix D.2, we prove Lemma 2.1.

D.1 Proof of Lemma 2.2

The key ingredient of the proof is Lemma C.1 (for Rademacher quadratic forms). We can view  Zr(v, . . . , v, 0, . . . , 0)as the following quadratic form:

image

where  N = 1v2. Since the support of  ηr,iis {0, 1} and due to symmetry of this random variable, it is tractable to analyze the expressions in Lemma C.1.

Proof of Lemma 2.2. First, we handle the case of T = 2:

image

image

as desired.

image

is equal to

image

where the last line follows from the fact that since  T ≥ semv2and  s ≤ m/e, we know that:

image

Setting t = T/M, we obtain, up to constants:

image

We can take a derivative to obtain the two expressions in the lemma statement at the following regimes of parameters:  max(1, Tv2) ≤ ln(Tmv2/s) ≤ Tand  ln(Tmv2/s) > T. The second regime aligns with the lemma statement. Thus it suffices to show that when  v ≤

straightforward calculation16.

image

the above calculations, without taking the sum that we obtain a lower bound of

image

D.2 Proof of Lemma 2.1

In Section 2.1, we discussed the tractability issues with using the general quadratic form moment bound Lemma C.1 to upper bound  ∥Zr(x1, . . . , xn)∥q. Thus, we require simpler bounds that are easier to analyze. Linear forms naturally arise in the upper bound since  Zr(x1, . . . , xn) =��1≤i≤n ηr,iσr,ixi�2−�1≤i≤n ηr,ix2i ≤��1≤i≤n ηr,iσr,ixi�2. However, it turns out that a vanilla linear form bound (e.g. Proposition C.3) here is weak due to the loss arising from ignoring the �1≤i≤n ηr,ix2iterm. Thus, we use Lemma C.4 (our generalized bound tailored to squares of linear forms with a zero diagonal) to obtain:

Lemma D.1 If  ∥x∥∞ ≤ vand  ∥x∥2 ≤ 1, then we have that:

image

Proof. This can be seen by simply taking  Yi = ηr,iσr,iin Lemma C.4.

image

on  v(m, ϵ, δ, s)in Theorem 1.5. The lower bound on moments of  ∥Zr(v, . . . , v, 0, . . . , 0)∥Tin Lemma 2.2 sheds light on where this loss may be arising. We see that the problematic case is when  v ≥

we require a new bound for this regime. Since the vector  [v1, . . . , v1, 0, . . . , 0]is in  Svwhen  v1 ≤ v, we can’t hope to beat the bound of  ||Zr(v1, . . . , v1, 0, . . . , 0)||T ≳ T 2v21ln2(T mv21/s) ≃ Tln(m/s)from Lemma 2.2. We show that we can match this value:

Lemma D.2 Suppose that  x = [x1, . . . , xn]satisfies  ∥x∥2 = 1and  ∥x∥∞ < v. If  s ≤ m/e, T ≥ semv2 , T ≥ 3, T ≥ ln(mv2/s), then:

image

many smaller  |xi|that are still allowed to be present. We separate out  |xi| ≥

In the quadratic form formulation of  Zr(x1, . . . , xn), this separation cannot be carried out, since there would

16In fact, v =√ln(m/s)√Tis very close to the value where  Tv2 = ln(Tmv2/s), so this approximation is essentially tight.

image

be cross-terms between  |xi| ≥

image

Proof of Lemma D.2. WLOG, assume that  |x1| ≥ |x2| ≥ . . . ≥ |xn|. Let  P =� Tln(m/s)�. We know that

image

For  1 ≤ i ≤ P, we use the bound  | �pi=1 ηr,iσr,ixi| ≤ �pi=1 |xi| ≤�� Tln(m/s)�≤ 2�

terms, we take  Yi = ηr,iσr,iin Proposition C.3 to obtain the following upper bound17for  |xi| ≤ v′ :=√ln(m/s)√Tand  ∥x∥2 ≤ 1:

image

Based on the conditions in this lemma statement, we know that mT v′2s = mT ln(m/s)sT = ms ln(m/s) ≥ e. Thus taking a derivative, we obtain that this can be upper bounded by taking  t = ln(mTv′2/s)which yields:

image

Finally, combining Lemma D.1 and Lemma D.2 yields Lemma 2.1:

Proof of Lemma 2.1. We apply Lemma D.1 at T = 2 to directly obtain T sm, and for  T ≥ 3, we apply Lemma D.1 and take a derivative to obtain:

image

To obtain the desired bound, we also include the bound from Lemma D.2 in the middle regime.

Now, we show to move from bounds on moments of individual rows (i.e.  Zr(x1, . . . , xn)) to bounds on moments of  R(x1, . . . , xn). In Appendix E.1, we obtain an upper bound on  ∥R(x1, . . . , xn)∥q, thus proving Lemma 2.3. In Appendix E.2, we obtain a lower bound on  ∥R(x1, . . . , xn)∥q, thus proving Lemma 2.4.

E.1 Proof of Lemma 2.3

Since the  ηr,iare negatively correlated, we can always upper bound the moments of  R(x1, . . . , xn)by the case of a sum of independent random variables when q is even18 Z′r(x1, . . . , xn) ∼ Zr(x1, . . . , xn).

image

s  · ∥R(x1, . . . , xn)∥q ≤

image

where the last inequality follows from Lemma C.5. Thus, it remains to analyze the sup expression. It turns out that each regime of bounds in Lemma 2.1 collapses to one value, so the different regimes in Lemma 2.1 correspond to different parts of the max expressions in Lemma 2.3. Depending on the parameters, some of these regimes may not exist, as is reflected by branches of the max expression sometimes vanishing in Lemma 2.1. We defer the computation to Appendix F.

E.2 Proof of Lemma 2.4

Moving from a lower bound on the moments of individual rows given by Lemma 2.2 to moments of R(v, . . . , v, 0, . . . , 0) is more delicate. Unlike in the upper bound, the negative correlations between random variables require some care to handle, even with the simplification that the s nonzero entries in a column are chosen uniformly at random. For example, the conditional distribution of  ηs+1,1 | η1,1 = η2,1 = . . . = ηs,1 = 1is 0, while the marginal distribution of  ηs+1,1has expectation s/m. One aspect that simplifies our analysis is that we know from our proof of Lemma 2.3 which moments of  Zr(x1, . . . , xn)are critical in the sup expression in (2). We only need to account for these particular moments in our lower bound approach. It turns out that the three critical values are q/T = 2, q/T = q, and  q/T = ln(qmv4/s2).

For q/T = q, where rows are isolated, we can directly obtain a bound from Lemma C.6 and Lemma 2.2 to obtain.

Lemma E.1 Suppose  As,m,nis a uniform sparse JL distribution. Suppose that q is even,  s ≤ m/e, q ≥ semv2,

image

Proof. By Lemma C.6 with T = 1, we have that:

image

Now, we apply Lemma 2.2 to obtain the desired expression.

For q/T = 2 and  q/T = ln(qmv4/s2), we make use of the Lemma E.2 that relates moments of products of rows to products of moments of rows by taking advantage of either s and 1v2being sufficiently large. The method essentially uses a counting argument to show that not too many terms vanish as a result of negative correlations, and requires adding in an indicator for the number of nonzero entries in a row being 2 for some cases (which is sufficient to prove Lemma 2.4).

Lemma E.2 Suppose  As,m,nis a uniform sparse JL distribution. If  1 ≤ T ≤ q/2is an integer, q/T is an even integer, 1v2is an even integer, and  2Tv2 ≤ s, then:

image

We defer the proof to Appendix G. Now we can use Lemma C.6 coupled with Lemma E.2 and Lemma 2.2 to handle the cases of q/T = 2, ln(qmv4/s2)and obtain the following bounds. For q/T = 2, we obtain:

Lemma E.3 Suppose  As,m,nis a uniform sparse JL distribution. If q is an even integer, qv2s ≤ 1, and 1v2is an even integer, then it is true that:

image

Now, by Lemma 2.2, we can see that  ∥Z1(v, . . . , v, 0, . . . , 0)∥2 ≳ sm. Thus, our bound becomes:

image

For  q/T = ln(qmv4/s2), we similarly obtain the following bound using Lemma C.6 coupled with Lemma E.2.

Lemma E.4 Suppose  As,m,nis a uniform sparse JL distribution. Suppose that q is a power of  2, s ≤ m/e, 2qv2 ≤ 0.5s ln(qmv4/s2), 1v2is even,  2 ≤ ln(qmv4/s2) ≤ q, and  m ≥ q. Then it is true that:

image

Proof. Let’s let f(x) be the function that rounds x to the nearest power of 2. By the conditions, we know that  2 ≤ f(ln(qmv4/s2)) ≤ q. Now, we want the condition  2qv2 ≤ sf(ln(qmv4/s2))to be satisfied. If f(ln(qmv4/s2)) ≥ ln(qmv4/s2), then this is implied by  2qv2 ≤ s ln(qmv4/s2) = s max(ln(qmv4/s2), 2), which is a strictly weaker condition than the one given in the lemma statement. If  f(ln(qmv4/s2)) ≤ ln(qmv4/s2), then  f(ln(qmv4/s2)) ≥ 0.5 ln(qmv4/s2)and so  2qv2 ≤ 0.5s ln(qmv4/s2) ≤ sf(ln(qmv4/s2))gives the desired condition.

We use the fact that  ln(qmv4/s2)/2 ≤ f(ln(qmv4/s2)) ≤ 2 ln(qmv4/s2). We apply Lemma E.2 and Lemma C.6, with  T = qf(ln(qmv4/s2))and Lemma 2.2 to see that if we have the additional condition that f(ln(qmv4/s2)) ≥ semv2, then we know that:

image

With these bounds, Lemma 2.4 follows.

Proof of Lemma 2.4. We combine Lemma E.1, Lemma E.3, and Lemma E.4.

First, we use Lemma C.5 and Lemma 2.1 to prove a upper bound  ∥R(x1, . . . , xq)∥qthat is not quite in the desired form for Lemma 2.3.

Lemma F.1 Let  2 ≤ q ≤ mbe an even integer and  |xi| ≤ vand  ∥x∥2 = 1. If semv2 ≥ q, then:

image

If  ln(qmv2/s) > qthen we have

image

In all other cases, we have that

image

The functions are defined as follows.

image

Proof of Lemma F.1. As we discussed in Appendix E, it suffices to bound

image

Our bounds on  ∥Z1(x1, . . . , xn)∥tare based on Lemma 2.1. We split into cases based on the T value, and how it separates into different cases in Lemma 2.1. Let

image

Let’s first consider semv2 ≥ q. In this case, only the  β1branch arises. Now, suppose that semv2 < q. Suppose that  ln(qmv2/s) > q. Then we show that the  β34branch does not arise. It suffices to show that ln(Tmv2/s) > Tfor all  T ≥ semv2. Let  x = Tmv2/s. It suffices to show that smv2 xln x < 1for all  e ≤ x ≤ qmv2s.

Since smv2 xln x < 1at  x = qmv2sand this is an increasing function of x, we know that the condition is true. We now produce bounds  α1(q, v, s, m), . . . , α4(q, v, s, m)such that  βi(q, v, sm) ≲ αi(q, v, s, m), which is what we do for the remainder of the analysis.

First, we handle the  β1(q, v, s, m)term. We see that

image

Now, we handle the  β2(q, v, s, m)term. We obtain a bound for  ∥Zr∥T ≲ v2 � smT v2�2/T. The expression becomes:

image

Suppose that  ln(qmv4/s2) ≥ 2. In this case, we have that this expression is upper bounded by  T = ln(qmv4/s2). When we plug this into the expression, we obtain qv2s ln(qmv4/s2). Otherwise, if  ln(qmv4/s2) ≤ 2, then this expression is upper bounded by T = 3:

image

We know that that q2/3v2/3s1/3m1/3 ≤ √q√mbecause this reduces to

image

Now, we handle the  β4(q, v, s, m)term when  ln(qmv2/s) ≤ q.

image

If  s ≤ qv2, this is bounded by 1, and if  s ≥ qv2, this is bounded by� sqv2�1/ ln(mv2/s). We see that sqv2 ≤ mv2s,

image

Now, we handle the  β3(q, v, s, m)term. In this case, the expression becomes:

image

We use some function bounding arguments to come with a simpler bound for  α3for sufficiently large v.

Lemma F.2 Assume that  C2q3mv4 ≥ s2for some  C2 ≥ 1. Then it is true that

image

Proof of Lemma F.2. With the assumptions that we made we know that sq3v2C22 ≤ mv2s. This implies that our expression becomes:

image

Let  Tminbe the minimum T such that  T ≥ max( semv2 , 3, ln(mv2T/s)). We just need to bound

image

First, we handle the second term. Let  Q = mv2Ts. We use that  Tmin ≥ semv2, so  mv2Tmins ≥ eto conclude  Q ≥ e. We see that

image

We see that setting Q to its maximum value achieves within a factor of e of the maximum value of Qln2(Q). Thus, we obtain that this is upper bounded by  e3 qln2(mv2q/s).

Now, we just need to handle the first term. If  Tmin ≥ ln q, then this term doesn’t exist. Let’s take a log of the expression to obtain:

image

The sign of the derivative is the same as:

image

Since  Tmin ≥ semv2, we know that  ln(mv2T/s) ≥ 0. Thus, we know that  1 − 2ln(mv2T/s) ≤ 1. Since  T ≤ ln q, we know that ln qT ≥ 1, so  − 2 ln qT ≤ −2. Thus, the derivative is negative, so the sup is attained at  Tmin = T, where the expression is:

image

Thus, to upper bound by qln2(mv2q/s), it suffices to show:

image

Now, we combine Lemma F.1 and Lemma F.2 to prove Lemma 2.3.

Proof of Lemma 2.3. First, we compute the second moment by hand:

image

For  2 < q ≤ m, we apply Lemma F.1 and Lemma F.2. We only include  α4when  ln(qmv4/s2) ≥ 2to simplify the bound. The bound follows.

We prove Lemma E.2.

Proof of Lemma E.2. First, we show the following fact: Suppose that there are T distinguishable buckets and we want to a assign an ordered pair of 2 unequal elements in [N] to each bucket so that the total number of times that any element  i ∈ [N]shows up is  ≤ s. We show that the number of such assignments is at least CT N 2Tfor some constant C. To prove this, we first consider the case where  N ≥ 2T. In this case, we have that the number of such assignments is at least:

image

Now, if N < 2T, then we define:

image

We partition 2T into  βblocks, each of size N, until potentially the last block, which may be smaller. We can read off ordered pairs assigned to each bucket from this formulation. Let’s assume that each block is a permutation of 1, . . . , N, and the last block is  2T − (β − 1)(N)non-equal numbers drawn from 1, . . . , N. (this satisfies the unequal ordered pair condition). Then the number of assignments is  (N!)β−1 ·(N)(N −1) . . . (N −(2T − (β − 1)(N)) + 1). This is at least as big as  C2T1 N 2Tfor some constant  C1.

image

1 ≤ x ≤ s:

image

We know that

image

where  Yrhas expectation 0. In this case we have that

image

where Q consists of terms that contain a factor of some  Yr. Due to the independence of the Rademachers, the expectation of any term that contains a factor of  Yrhas expectation 0, which implies that:

image

We know that

image

where  Q′consists of terms that contain a factor of some  Y ′r. For similar reasons, this implies that

image

Let’s view �Tk=1 η′k,ikη′k,jkand  �Tk=1 ηk,ikηk,jkas terms in a sum. In the second expression, every term has expectation� sm�2T, and there are at most  N 2Tterms. In the first expression, if there are > s copies of any  ikvalue, then the expectation is 0. Otherwise, the expectation varies between  C−2T2 � sm�2Tand� sm�2T. By the counting argument at the beginning of the proof, we know that there are at least  C2T1 N 2Tterms. This implies that

image

as desired.

image

for  1 ≤ x ≤ s:

image

where  Yrhas expectation  ≥ 0. In this case we have that

image

where Q has expectation  ≥ 0. This implies that:

image

where  Q′consists of terms that contain a factor of some  Y ′r. For similar reasons to the above, we have that:

image

Let’s view �Tk=1 (ηk,ikηk,jk)q/Tand  �Tk=1�η′k,ikη′k,jkIM ′k=2�q/Tas terms in a sum. In the second expression, every term has expectation  ≤� sm�2T(the indicator can only reduce the expectation), and there are at most N 2Tterms. In the first expression, if there are > s copies of any  ikvalue, then the expectation is 0. Otherwise, the expectation varies between  C−2T2 � sm�2Tand� sm�2T. By the counting argument, we know that there are at least  C−2T1 N 2Tterms. This implies that

image

as desired.

Recall that our proof of Theorem 1.5 requires cleaner bounds on moments of  ∥R(x1, . . . , xn)∥qthat follow simplifying the bounds in Lemma 2.3 and Lemma 2.4 at the target values of v. The proofs of these lemmas boil down to function bounding and simplification.

H.1 Proof of Lemma 3.1

First, we show how Lemma 2.3 implies Lemma 3.1. The proof involves simplifying and bounding the function at the target v value.

Proof of Lemma 3.1. We plug q = p into Lemma 2.3. We use this relaxed version of the bound: If semv2 ≥ q, then  ∥R(x1, . . . , xn)∥q ≲ √q√m. Otherwise, if there exists  C2q3mv4 ≥ s2, then

image

Suppose that the absolute constant on the upper bounds is  ≤ C′. Let  C = max(C′, 1)(we take C to be the constant on the upper bounds). Let’s take  Cv,2 = 0.25√C,  Cv,1 = min� 0.1C3/2 , Cv,2�, CS = 4C, CM =

image

First, let’s analyze  v = f2. We show that  ln(pmf 42 /s2) ≥ 2. Observe that  ln(pmf 42 /s2) = ln(C4v,2 ln2(mϵ2/p))+

ln(mϵ2/p). Using the fact that  m ≥ e

image

Now, since  m ≥ e2ϵ−2p, this implies that

image

Moreover, we know that  p3mf 22 ≥ s2, since  pmv4 ≥ e2s2. Now, we show that  C pf 22s ln(pmf 42 /s2) ≤ 0.25ϵ. Let’s observe that

image

Now, we handle the case where  m ≥ s·e

using that  m ≥ se, this immediately follows from ps ln(m/s) ≤ ps ≤ 0.25ϵ. Otherwise, we need it to be true that s ln(m/s) ≥ 4Cpϵ−1. This can be written as  ln(m/s) ≥ 4Cpϵ−1s. Since  CS = 4C, this can be written as:

m  ≥ s ·e

v = f2:

image

only exist if  s ≤ Θ(ϵ−1 ln(1/δ)). Observe that we can set  C2 = 1C4v,1and using the fact that  Cv,1 ≤ Cv,2, we obtain that

image

Thus, this is lower bounded by 1 when  C2 = 1C4v,1.

First, we analyze the case of  v = f1. We show that CC1/32 p2v2s ln2(pmv2/s) ≤ 0.1ϵ. Observe that

image

where the last inequality uses the fact that  Cv,1 ≤ 0.1C3/2.

image

bound if  ln(pmv4/s2) ≥ 2. First, we show this is an increasing function of v. Let  w = pmv4/s2. We see that

√wln w. We observe that this is an increasing function of w as long as  w ≥ e2, which is exactly

image

if  ln(pmv2/s) ≥ 1. First, we show that  f(v) ≤ 2f(v′)if  v ≤ v′. Let  w = pmv2/s. We see that p2v2s ln(pmv2/s) =

image

this is bounded by at most a factor of 2 above any other w value.

image

that  ∥R(x1, . . . , xn)∥q ≤ 0.25ϵ.

image

Otherwise, we know that  ln(pmv4/s2) > 2. First let’s show that that  C pv2s ln(pmv4/s2) ≤ 0.25ϵ. We know that v ≤ f2. At  v = f2, we know that the expression is upper bounded by  0.25ϵ. Since the pv2s ln(pmv4/s2)term is an increasing function of v in this regime, this means that we get a bound of  0.25ϵin this case too. Thus, we know that:

image

Now, suppose that  v = f2. We’ve already shown that  ln(pmv4/s2) ≥ 2here (near the beginning of the proof). Since  v ≤ f1, we obtain a bound of  2 · 0.1ϵ = 0.2ϵ. This means:

image

H.2 Proof of Lemma 3.2

Now, we show how Lemma 2.3 and Lemma 2.4 imply Lemma 3.2. The proof simply involves bounding and simplifying the functions in the original lemmas at the target v value.

Proof of Lemma 3.2. We use Lemma 2.3 but put in an absolute constant. Let  D2 > 0be such that: if semv2 ≥ q, then

image

Otherwise, if  q3mv4 ≥ s2, then  ∥R(x1, . . . , xn)∥qis upper bounded by:

image

We use Lemma 2.4 but put in an absolute constant  D1 > 0(which we take to be  ≤ 1). Let  2 ≤ q ≤ mbe an even integer, and suppose that  0 < v ≤ 0.5and 1v2is an even integer. If  qv2 ≤ s, then

image

If m  ≥q,  2 ≤ ln(qmv4/s2) ≤q,  2qv2 ≤ 0.5s ln(qmv4/s2), and s  ≤m/e then:

image

First, we handle the case where  m ≤ Θ(ϵ−2 ln(1/δ)). Let’s take  v = ψfor any sufficiently small  ψ. By sufficiently small, we mean  v2 ≤ se2mqand  0 < v ≤ 0.5. This implies that semv2 ≥ 2qand  qv2 ≤ s. Thus we know

(using that q  ≤m) that  ∥R(x1, . . . , xn)∥2q ≤ D2

image

as desired. Suppose that  m ≤ Θ(ϵ−2 ln(1/δ)). Based on the setting q, this means that  ∥R(v, . . . , v, 0 . . . , 0)∥q ≥

image

us to assume that  s ≤ Θ(ϵ−1 ln(1/δ))and  m ≤ ϵ−2ep. Let  f1 = 4√ϵsln( mϵq )qand let  f2 = √ϵs

consider  v = Cv,1f1 =: v1and  v = Cv,2f2 =: v2. First, we handle the condition of  q3mv4 ≥ s2. We enforce the condition  Cv,1, Cv,2 ≥ 1. Assuming that  v ≥√ϵsq(which is true at the two values of v that we consider), we

know q3mv4s2 ≥ mϵ2q ≥ 1. Also, we make  m ≥ 2C2ϵ−2q, so that�

Consider  v = v2. We first check that the conditions for the upper bound are satisfied. We have that

image

ln(qmv4/s2) ≥ 2. Also, we have that qmv2se = √qm�

needed for the lower bound. Observe that

image

as desired. We check that  ln(qmv4/s2) ≤ q. It suffices to show that

image

Using the condition that  m ≤ ϵ−2 eqqC4v,2where we obtain that

image

as desired. Now, we compute the value of qv2s ln(qmv4/s2)at  v = Cv,2f2. We obtain:

image

that qmv2s = 16C2v,1mϵq ln2( mϵq ). Observe that when  Cv,1 ≥ 1and  m ≥ e2ϵ−2q ≥ e2ϵ−1q, this is lower bounded by  e2, so  ln(qmv2/s) ≥ 2. Now, we claim that when  f1 ≤ f2, we show that  ln(qmv2/s) ≤ q/2. In this case, using that  m ≤ ϵ−2qeq, we have: 4 ln(mϵ/q)q ≤

Observe that

image

At this value, observe that:

image

Let  C = D1. Let’s set�

C. Using the fact that  v2 ≤ 0.5(so 1v2 ≥ 2), this means that 1v2has can take on at least 3 different powers of 2. Let’s observe that when  16C2v ln2( mϵq ) ≤ mϵq(we can get this condition by saying that  m ≥ CM,2ϵ−2qfor a sufficiently large  CM,2) and  16C2v ln2(mϵ/q) ≥ 1(we can get this condition by saying that  m ≥ CM,2ϵ−2qfor a sufficiently large  CM,2), we know that

image

Suppose that  C4v ln2(mϵ2/q) ≤ mϵ2q(we can get this condition by saying that  m ≥ CM,2ϵ−2qfor a sufficiently large  CM,2) and  C4v ln2(mϵ2/q) ≥ 1(we can get this condition by saying that  m ≥ CM,2ϵ−2qfor a sufficiently large  CM,2). Let’s observe that

image

Let  m′ = s · eCϵ−1q1024s. When  m ≥ m′, we know that qs ln(m/s) ≤ 1024ϵCand when  m ≤ m′, we know that qs ln(m/s) ≥ 1024ϵC.

In order to plug in  v = v1and use the q2v2s ln2(qmv2/s)lower bound, we need to show that  v ≤√ln(m/s)√q. At

image

ln(qmv2/s) ≥ 2. At this value, observe that:

image

as long as  w ≥ e2. Thus, it suffices to show that q2v21s ln2(qmv21/s) ≤ q2v2s ln2(qmv2/s). When  m ≤ m′, we know that

image

For the upper bound, we see that  ln(2qmv4/s2) > ln(qmv4/s2) ≥ 2and�

image

Now, we use the fact that  v ≤ Cvf1 := v1to see that:

image

We also observe that since  2qmv4/s ≤ (qmv4/s)2, we know:

image

This, coupled with the guarantee on√2q√m, implies we have an upper bound of:

image

Moreover, we have that

image

The next case is  f1 ≤ f2and  m ≤ m′. We set  v = v1. Since  f1 ≤ f2, we know that  ln(qmv4/s2) ≤ q. Thus we know:

image

For the upper bound, we know that:

image

To make these bounds compatible, we need to handle the case where  ln(qmv4/s) ≥ 2, qv2 ≥ sbetter. Let v′ = Cvf2. Assuming that  ln(qmv4/s) ≥ 2, we know that 8192qv2s ln(2qmv4/s)can be upper bounded by:

image

as long as  ln2(mϵ/q)C4v ≥ 1(which we can make true by appropriately setting the constants on the bound for m). Observe also that:

image

This, coupled with the guarantee on√2q√m, implies that our upper bound becomes:

image

∥R(v, . . . , v, 0, . . . ,  0)∥2q ≤

image

We now show that we can tweak  Cvwithin the factor of  21/4range permitted to show that we can ensure that it is not true that  2 − ln 2 < ln(qmv4/s) ≤ 2. Observe that multiplying by a factor of  21/4in this case yields ln(2qmv4/s) > 2and dividing by a factor of  21/4yields  ln(qmv4/s) ≤ 2 − ln 2. Thus, at least one of the  Cvvalues that yields a power of 2 for 1v2will work. Thus, we have that

image

Moreover, we have that:

image

The next case is that  m > m′. We set  v = Cv√ϵs

image

For the upper bound, we see that  ln(2qmv4/s2) > ln(qmv4/s2) > 2. We know:

image

This can be relaxed to:

image

Now, we know that

image

This coupled with what we know about√2q√mmeans that:

image

Moreover, we have that

image

We use the condition on q not being more than a constant factor away from  p = ln(1/δ), to conclude that

image

move within the  Θnotation as well.

All of the experiments (in Section 4 and in this section) were run on the default hardware on a Google Colab notebook. The code is available at https://github.com/mjagadeesan/sparsejl-featurehashing.

First, we give the results of additional experimental results on real-world and synthetic datasets, using the same experimental setup as Section 4.

image

Figure 5: Phase transitions of  ˆv(m, 0.2, 0.01, s)

Figure 6: Phase transitions of  ˆv(m, 0.02, 0.05, s)

For the synthetic datasets, the trends in Figure 5 and Figure 6 look quite similar to the figures in Section 4. We see, though, that Figure 6 experiences more severe non-monotonic behavior as a function of s in the second phase transition. Consider, for example, in Figure 6, the behavior at m = 12000: we see that ˆv(m, ϵ, δ, 4) < ˆv(m, ϵ, δ, 3). In fact, the order of the phase transitions in Figure 6 is far from decreasing. Nonetheless, the general patterns and trends in the theoretical result still hold (e.g. the “flat” part occurs at a lower y-coordinate for lower s values.)

For the real-world datasets, the trends in Figure 7, Figure 8, and Figure 9 look quite similar to the figures in Section 4. One slight difference is that the failure probability noticeably increases in Figure 7 and Figure

image

Figure 7: ˆδ(m, s, 0.1)on News20

Figure 8: ˆδ(m, s, 0.1)on Enron

image

Figure 9: ˆδ(m, s, 0.03)on News20

8 between s = 8 and s = 16. It turns out that the failure probability actually increases to a local maximum somewhere in  12 ≤ s ≤ 16, and then decreases when  s ≥ 16, reaching lower than the value at s = 8 by the time s = 20. There turns out to be a similar local maximum phenomenon when  ϵ = 0.07and m = 500, though the local maximum occurs in  24 ≤ s ≤ 32and thus is not as visible in the graph.

As a general comment on non-monotonicity as a function of s, we emphasize that our asymptotic theoretical results characterize the macroscopic behavior of  v(m, ϵ, δ, s), and do not preclude the existence of constant factor fluctuations for small changes in parameters. An interesting direction for future work would be to look further into this non-mononocity and try to characterize when it arises.


Designed for Accessibility and to further Open Science