Feature hashing and other random projection schemes are influential in helping manage large data [11]. The goal is to reduce the dimensionality of feature vectors: more specifically, to project high-dimensional feature vectors living in into a lower dimensional space
(where
), while approximately preserving Euclidean distances (i.e.
distances) with high probability. This dimensionality reduction enables a classifier to process vectors in
, instead of vectors in
. In this context, feature hashing was first introduced by Weinberger et. al [29] for document-based classification tasks such as email spam filtering. For such tasks, feature hashing yields a lower dimensional embedding of a high-dimensional feature vector derived from a bag-of-words model. Since then, feature hashing has become a mainstream approach [28], applied to numerous domains including ranking text documents [4], compressing neural networks [7], and protein sequence classification [5].
Random Projections
Dimensionality reduction schemes for feature vectors fit nicely into the random projection literature. In fact, the feature hashing scheme proposed by Weinberger et al. [29] boils down to uniformly drawing a random matrix where each column contains one nonzero entry, equal to
or 1.
The -norm-preserving objective can be expressed mathematically as follows: for error
and failure probability
, the goal is to construct a probability distribution A over
real matrices that satisfies the following condition for vectors
:
The result underlying the random projection literature is the Johnson-Lindenstrauss lemma, which gives an upper bound on the dimension m achievable by a probability distribution A satisfying (1):
Lemma 1.1 (Johnson-Lindenstrauss [16]) For any and
, there exists a probability distribution A over
matrices, with
, that satisfies (1).
The optimality of the dimension m achieved by Lemma 1.1 has been proven [17, 15].
To speed up projection time, it is useful to consider probability distributions over sparse matrices (i.e. matrices with a small number of nonzero entries per column). More specifically, for matrices with s nonzero entries per column, the projection time for a vector x goes down from to
, where
is the number of nonzero entries of x. In this context, Kane and Nelson [19] constructed sparse JL distributions (which we define formally in Section 1.1), improving upon previous work [2, 22, 12]. Roughly speaking, a sparse JL distribution, as constructed in [19], boils down to drawing a random
matrix where each column contains exactly s nonzero entries, each equal to
or
. Kane and Nelson show that sparse JL distributions achieve the same (optimal) dimension as Lemma 1.1, while also satisfying a sparsity property.
Theorem 1.2 (Sparse JL [19]) For any and
, a sparse JL distribution
(defined formally in Section 1.1) over
matrices, with dimension
and sparsity
, satisfies (1).
Sparse JL distributions are state-of-the-art sparse random projections, and achieve a sparsity that is nearly optimal when the dimension m is However, in practice, it can be necessary to utilize a lower sparsity s, since the projection time is linear in s. Resolving this issue, Cohen [8] extended the upper bound in Theorem 1.2 to show that sparse JL distributions can achieve a lower sparsity with an appropriate gain in dimension. He proved the following dimension-sparsity tradeoffs:
Theorem 1.3 (Dimension-Sparsity Tradeoffs [8]) For any and
, a uniform sparse JL distribution
(defined formally in Section 1.1), with
and
satisfies (1).
Connection to Feature Hashing
Sparse JL distributions have particularly close ties to feature hashing. In particular, the feature hashing scheme proposed by Weinberger et al. [29] can be viewed as a special case of sparse JL, namely with s = 1. Interestingly, in practice, feature hashing can do much better than theoretical results, such as Theorem 1.2 and Theorem 1.3, would indicate [13]. An explanation for this phenomenon is that the highest error terms in sparse JL stem from vectors with mass concentrated on a very small number of entries, while in practice, the mass on feature vectors may be spread out between many coordinates. This motivates studying the tradeoff space for vectors with low -to-
ratio.
More formally, take to be
, so that
and
for
. Let
be the supremum over all
such that a sparse JL distribution with sparsity s and dimension m satisfies (1) for each
. (That is,
is the maximum
such that for every
, if
then (1) holds.
) For s = 1, a line of work [29, 12, 18, 10, 19] improved bounds on
, and was recently closed by Freksen et al. [13].
Theorem 1.4 ([13]) For any and
, the function
is equal to
where:
Generalizing to Sparse Random Projections with s > 1
While Theorem 1.4 is restricted to the case of s = 1, dimensionality reduction schemes constructed using sparse random projections with sparsity s > 1 have been used in practice for projecting feature vectors. For example, sparse JL-like methods (with s > 1) have been used to project feature vectors in machine learning domains including visual tracking [27], face recognition [23], and recently in ELM [6]. Now, a variant of sparse JL is included in the Python sklearn library.
In this context, it is natural to explore how constructions with s > 1 perform on feature vectors, by studying for sparse JL with s > 1. In fact, a related question was considered by Weinberger et al. [29] for “multiple hashing,” an alternate distribution over sparse matrices constructed by adding s draws from
and scaling by
. More specifically, they show that
for multiple hashing. However, Kane and Nelson [19] later showed that multiple hashing has worse geometry-preserving properties than sparse JL: that is, multiple hashing requires a larger sparsity than sparse JL to satisfy (1).
Characterizing for sparse JL distributions, which are state-of-the-art, remained an open problem. In this work, we settle how
behaves for sparse JL with a general sparsity s > 1, giving tight bounds. Our theoretical result shows that sparse JL with s > 1, even if s is a small constant, can achieve significantly better norm-preservation properties for feature vectors than sparse JL with s = 1. Moreover, we empirically demonstrate this finding.
Main Results
We show the following tight bounds on for a general sparsity s:
Theorem 1.5 For any such that
, consider a uniform sparse JL distribution (defined in Section 1.1) with sparsity s and dimension
If
and
are small enough
, the function
is equal
Our main result, Theorem 1.5, significantly generalizes Theorem 1.2, Theorem 1.3, and Theorem 1.4. Notice our bound in Theorem 1.5 has up to four regimes. In the first regime, which occurs when , Theorem 1.5 shows
, so (1) holds on the full space
. Notice this boundary on m occurs at the dimensionality-sparsity tradeoff in Theorem 1.3. In the last regime, which occurs when
, Theorem 1.5 shows that
, so there are vectors with arbitrarily small
-to-
norm ratio where (1) does not hold. When
, Theorem 1.5 shows that up to two intermediate regimes exist. One of the regimes,
, matches the middle regime of
in Theorem 1.4 with an extra factor of
, much like the bound for multiple hashing in [29] that we mentioned previously. However, unlike the multiple hashing bound, Theorem 1.5 sometimes has another regime,
, which does not arise for s = 1 (i.e. in Theorem 1.4).
Intuitively, we expect this additional regime for sparse JL with s close to
: at
and
, Theorem 1.2 tells us
, but if
is a constant, then the branch
yields
, while the branch
yields
. Thus, it is natural that the first branch disappears for large m.
Our result elucidates that increases approximately as
, thus providing insight into how even small constant increases in sparsity can be useful in practice. Another consequence of our result is a lower bound on dimension-sparsity tradeoffs (Corollary A.1 in Appendix A) that essentially matches the upper bound in Theorem 1.3. Moreover, we require new techniques to prove Theorem 1.5, for reasons that we discuss further in Section 1.2.
We also empirically support our theoretical findings in Theorem 1.5. First, we illustrate with real-world datasets the potential benefits of using small constants s > 1 for sparse JL on feature vectors. We specifically show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. More specifically, we empirically show that
is not smooth, and that the middle regime(s) of
increases with s.
1.1 Preliminaries
Let be a sparse JL distribution if the entries of a matrix
are generated as follows. Let
where
and
are defined as follows:
• The families and
are independent from each other.
• The variables are i.i.d Rademachers (
coin flips).
• The variables are identically distributed Bernoullis ({0, 1} random variables) with expectation s/m.
• The are independent across columns but not independent within each column. For every column
, it holds that
. Moreover, the random variables are negatively correlated: for every subset
and every column
, it holds that
.
A common special case is a uniform sparse JL distribution, generated as follows: for every , we uniformly choose exactly s of these variables in
to be 1. When s = 1, every sparse JL distribution is a uniform sparse JL distribution, but for s > 1, this is not the case.
Another common special case is a block sparse JL distribution. This produces a different construction for s > 1. In this distribution, each column is partitioned into s blocks of
consecutive rows. In each block in each column, the distribution of the variables
is defined by uniformly choosing exactly one of these variables to be
1.2 Proof Techniques
We use the following notation. For any random variable X and value , we call
the qth moment of X, where E denotes the expectation. We use
to denote the q-norm
.
For every such that
, we need to analyze tail bounds of an error term, which for the sparse JL construction is the following random variable:
An upper bound on the tail probability of is needed to prove the lower bound on
in Theorem 1.5, and a lower bound is needed to prove the upper bound on
in Theorem 1.5. It turns out that it suffices to tightly analyze the random variable moments
. For the upper bound, we use Markov’s inequality like in [13, 19, 3, 24], and for the lower bound, we use the Paley-Zygmund inequality like in [13]: Markov’s inequality gives a tail upper bound from upper bounds on moments, and the Paley-Zygmund inequality gives a tail lower bound from upper and lower bounds on moments. Thus, the key ingredient of our analysis is a tight bound for
on
threshold v value.
While the moments of have been studied in previous analyses of sparse JL, we emphasize that it is not clear how to adapt these existing approaches to obtain a tight bound on every
. The moment bound that we require and obtain is far more general: the bounds in [19, 9] are limited to
and the bound in [13] is limited to
The non-combinatorial approach in [9] for bounding
on
also turns out to not be sufficiently precise on
, for reasons we discuss in Section 2.
Thus, we require new tools for our moment bound. Our analysis provides a new perspective, inspired by the probability theory literature, that differs from the existing approaches in the JL literature. We believe our style of analysis is less brittle than combinatorial approaches [13, 19, 3, 24]: in this setting, once the sparsity s = 1 case is recovered, it becomes straightforward to generalize to other s values. Moreover, our approach can yield greater precision than the existing non-combinatorial approaches [9, 8, 14], which is necessary for this setting. Thus, we believe that our structural approach to analyzing JL distributions could be of use in other settings.
In Section 2, we present an overview of our methods and the key technical lemmas to analyze . We defer the proofs to the Appendix. In Section 3, we prove the tail bounds in Theorem 1.5 from these moment bounds. In Section 4, we empirically evaluate sparse JL.
Our approach takes advantage of the structure of as a quadratic form of Rademachers (i.e.
) with random variable coefficients (i.e. where
is itself a random variable). For the upper bound, we need to analyze
for general vectors
. For the lower bound, we only need to show
is large for single vector in each
, and we show we can select the vector in the
-unit ball with
nonzero entries, all equal to v. For ease of notation, we denote this vector by [v, . . . , v, 0, . . . , 0] for the remainder of the paper.
We analyze using general moment bounds for Rademacher linear and quadratic forms. Though Cohen, Jayram, and Nelson [9] also view
as a quadratic form, we show in Appendix B that their approach of bounding the Rademachers by gaussians is not sufficiently precise for our setting.
In our approach, we make use of stronger moment bounds for Rademacher linear and quadratic forms, some of which are known to the probability theory community through Latała’s work in [21, 20] and some of which are new adaptions tailored to the constraints arising in our setting. More specifically, Latała’s bounds [21, 20] target the setting where the coefficients are scalars. In our setting, however, the coefficients are themselves random variables, and we need bounds that are tractable to analyze in this setting, which involves creating new bounds to handle some cases.
Our strategy for bounding is to break down into rows. We define
so that . We analyze the moments of
, and then combine these bounds to obtain moment bounds for
. In our bounds, we use the notation
(resp.
) to denote
(resp.
) for some constant C > 0.
2.1 Bounding ∥Zr(x1, . . . , xn)∥q
We show the following bounds on . For the lower bound, as we discussed before, it suffices to bound
. For the upper bound, we need to bound
for general vectors as a function of the
-to-
norm ratio.
Lemma 2.1 Let be a sparse JL distribution such that
. Suppose that
satisfies
and
. If T is even, then:
Lemma 2.2 Let be a sparse JL distribution. Suppose
and T are even integers. Then,
. Moreover, if
and
, then
and
We now sketch our methods to prove Lemma 2.1 and Lemma 2.2. For the lower bound (Lemma 2.2), we can view as a quadratic form
where
is an ap- propriately defined block-diagonal mn dimensional matrix. We can write
as
: for
values, the coefficients are scalars. We make use of Latała’s tight bound on Rademacher quadratic forms with scalar coefficients [21] to analyze
as a function of the
. Then, we handle the randomness of the
by taking an expectation of the resulting bound on
over the
values to obtain a bound on
.
For the upper bound (Lemma 2.1), since Latała’s bound [21] is tight for scalar quadratic forms, the natural approach would be to use it to upper bound for general vectors. However, when the vector is not of the form [v, . . . , v, 0, . . . , 0], the asymmetry makes the resulting bound intractable to simplify. Specifically, there is a term, which can be viewed as a generalization of an operator norm to an
ball cut out by
hyperplanes, that becomes problematic when taking an expectation over the
to obtain a bound on
. Thus, we construct simpler estimates that avoid these complications while remaining sufficiently precise for our setting. These estimates take advantage of the structure of
and enable us to show Lemma 2.1.
2.2 Obtaining bounds on ∥R(x1, . . . , xn)∥q
Now, we use Lemma 2.1 and Lemma 2.2 to show the following bounds on :
Lemma 2.3 Suppose is a sparse JL distribution such that
, and let
be such that
. Then,
. If
, then
. If
and if there exists a constant
such that
, then
where g is:
Lemma 2.4 Suppose is a uniform sparse JL distribution. Let q be a power of 2, and suppose that
and
is an even integer. If
, then
. If
, and
, then
. If
We now sketch how to prove bounds on using bounds on
. To show Lemma 2.3, we show that making the row terms
independent does not decrease
, and then we apply a general result from [20] for moments of sums of i.i.d symmetric random variables. For Lemma 2.4, handling the correlations between the row terms
requires more care. We show that the negative correlations induced by having exactly s nonzero entries per column do not lead to significant loss, and then stitch together
using the moments of
that contribute the most.
We now sketch how to prove Theorem 1.5, using Lemma 2.3 and Lemma 2.4. First, we simplify these bounds at the target parameters to obtain the following:
Lemma 3.1 Let be a sparse JL distribution, and suppose
and
are small enough,
,
, and
is even. If
satisfies
and
, then
.
Lemma 3.2 There is a universal constant D satisfying the following property. Let be a uniform sparse JL distribution, and suppose
are small enough,
, and q is an even integer such that
. For each
, there exists
, such that
and
.
Now, we use Lemma 3.1 and Lemma 3.2 to prove Theorem 1.5.
Proof of Theorem 1.5. Since the maps in are linear, it suffices to consider unit vectors x. First, we prove the lower bound on
. To handle
, we take q = 2 in Lemma 3.1 and apply Chebyshev’s inequality. Otherwise, we take
(approximately) and apply Lemma 3.1 and Markov’s inequality. We see that
can be expressed as:
Thus, condition (1) is satisfied for when
as desired.
Now, we prove the upper bound on . We need to lower bound the tail probability of R(v, . . . , v, 0, . . . , 0), and to do this, we use the Paley-Zygmund inequality applied to qth moments. Let D be defined as in Lemma 3.2, and take
. By the Paley-Zygmund inequality and Lemma 3.2, there exists
such that:
Thus, it follows that as desired.
Recall that for sparse JL distributions with sparsity s, the projection time for an input vector x is , where
is the number of nonzero entries in x. Since this grows linearly in s, in order to minimize the impact on projection time, we restrict to small constant s values (i.e.
). In Section 4.1, we demonstrate on real-world data the benefits of using s > 1. In Section 4.2, we illustrate trends in our theoretical bounds on synthetic data. Additional graphs can be found in Appendix I. For all experiments, we use a block sparse JL distribution to demonstrate that our theoretical upper bounds also empirically generalize to non-uniform sparse JL distributions.
4.1 Real-World Datasets
We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron).Both datasets were pre-processed with the standard tf-idf preprocessing. In this experiment, we evaluated how well sparse JL preserves the
norms of the vectors in the dataset. An interesting direction for future work would be to empirically evaluate how well sparse JL preserves other aspects of the geometry of real-world data sets, such as the
distances between pairs of vectors.
In our experiment, we estimated the failure probability for each dataset as follows. Let D be the number of vectors in the dataset, and let n be the dimension (n = 101631, D = 11314 for News20; n = 28102, D = 39861 for Enron). We drew a matrix
from a block sparse JL distribution. Then, we computed
for each vector x in the dataset, and used these values to compute an estimate
. We ran 100 trials to produce 100 estimates
.
Figure 1: News20: v. s
Figure 2: Enron: vs. s
Figure 1 and Figure 2 show the mean and error bars (3 standard errors of the mean) of at
. We consider
, and choose m values so that
.
All of the plots show that achieves a lower failure probability than s = 1, with the differences most pronounced when m is larger. In fact, at m = 1000, there is a factor of four decrease in
between s = 1 and s = 4, and a factor of ten decrease between s = 1 and s = 8, 16. We note that in plots in the Appendix, there is a slight increase between s = 8 and s = 16 at some
values (see Appendix I for a discussion of this non-monotonicity in s); however s > 1 still consistently beats s = 1. Thus, these findings demonstrate the potential benefits of using small constants s > 1 in sparse JL in practice, which aligns with our theoretical results.
4.2 Synthetic Datasets
We used synthetic data to illustrate the phase transitions in our bounds on in Theorem 1.5 for a block sparse JL distribution. For several choices of
, we computed an estimate
of
as follows. Our experiment borrowed aspects of the experimental design in [13]. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). The binary vectors were defined by a set W of values exponentially spread between 0.03 and
: for each
, we constructed a binary vector
where the first
entries are nonzero, and computed an estimate
of the failure probability of the block sparse JL distribution on the specific vector
(i.e.
). We computed each
using 100,000 samples from a block sparse JL distribution, as follows. In each sample, we independently drew a matrix
and computed the ratio
. Then, we took
number of samples where
. Finally, we used the estimates
to obtain the estimate
for all
where
.
Figure 3: Phase transitions of
Figure 4: Phase transitions of
Figure 3 and Figure 4 show as a function of dimension m for
for two settings of
and
. The error-bars are based on the distance to the next highest v value in W.
Our first observation is that for each set of values considered, the curve
has “sharp” changes as a function of m. More specifically,
is 0 at small m, then there is a phase transition to a nonzero value, then an increase to a higher value, then an interval where the value appears “flat”, and lastly a second phase transition to 1. The first phase transition is shared between s values, but the second phase transition occurs at different dimensions m (but is within a factor of 3 between s values). Here, the first phase transition likely corresponds to
and the second phase transition likely corresponds to
.
Our second observation is that as s increases, the “flat” part occurs at a higher y-coordinate. Here, the increase in the “flat” y-coordinate as a function of s corresponds to the term in
. Technically, according to Theorem 1.5, the “flat” parts should be increasing in m at a slow rate: the empirical “flatness” likely arises since W is a finite set in the experiments.
Our third observation is that s > 1 generally outperforms s = 1 as Theorem 1.5 suggests: that is, s > 1 generally attains a higher value than s = 1. We note at large m values (where
is close to 1), lower s settings sometimes attain a higher
than higher s settings (e.g. the second phase transition doesn’t quite occur in decreasing order of s in Figure 3): see Appendix I for a discussion of this non-monotonicity in
Nonetheless, in practice, it’s unlikely to select such a large dimension m, since the
-to-
guarantees of smaller m are likely sufficient. Hence, a greater sparsity generally leads to a better
value, thus aligning with our theoretical findings.
[1] The 20 newsgroups text dataset. https://scikit-learn.org/0.19/datasets/twenty_newsgroups. html.
[2] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, June 2003.
[3] Z. Allen-Zhu, R. Gelashvili, S. Micali, and N. Shavit. Sparse sign-consistent Johnson–Lindenstrauss matrices: Compression with neuroscience-based constraints. In Proceedings of the National Academy of Sciences (PNAS), volume 111, pages 16872–16876, 2014.
[4] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. Learning to rank with (a lot of) word features. Information Retrieval, 13(3):291–314, Jun 2010.
[5] C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classification using feature hashing. Proteome Science, 10(1), 2012.
[6] C. Chen, C. Vong, C. Wong, W. Wang, and P. Wong. Efficient extreme learning machine via very sparse random projection. Soft Computing, 22, 03 2018.
[7] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. Proceedings of the 32nd Annual International Conference on Machine Learning (ICML), pages 2285–2294, 2015.
[8] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 278–287, 2016.
[9] M. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson-Lindenstrauss transform. In Proceedings of the 1st Symposium on Simplicity in Algorithms (SOSA), pages 1–9, 2018.
[10] S. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity estimation and dimen- sionality reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pages 6618–6628, 2017.
[11] B. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms. Big Data, 1(2):110–112, 2013.
[12] A. Dasgupta, R. Kumar, and T. Sarlos. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341–350, 2010.
[13] C. Freksen, L. Kamma, and K. G. Larsen. Fully understanding the hashing trick. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 5394–5404, 2018.
[14] M. Jagadeesan. Simple analysis of sparse, sign-consistent JL. In Proceedings of the 23rd International Conference and 24th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 61:1–61:20, 2019.
[15] T.S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and steaming problems with subconstant error. In ACM Transactions on Algorithms (TALG) - Special Issue on , volume 9, pages 1–26, 2013.
[16] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.
[17] D. M. Kane, R. Meka, and J. Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In Proceedings of the 14th International Workshop and 15th International Conference on Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 628–639, 2011.
[18] D. M. Kane and J. Nelson. A derandomized sparse Johnson-Lindenstrauss transform. CoRR, abs/1006.3585, 2010.
[19] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 16872–16876. ACM Press, 2012.
[20] R. Latała. Estimation of moments of sums of independent real random variables. Annals of Probability, 25(3):1502–1513, 1997.
[21] R. Latała. Tail and moment estimates for some types of chaos. Studia Mathematica, 135(1):39–53, 1999.
[22] P. Li, T. Hastie, and K. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 287–296, 2006.
[23] C. Ma, J. Jung, S. Kim, and S. Ko. Random projection-based partial feature extraction for robust face recognition. Neurocomputing, 149:1232 – 1244, 2015.
[24] J. Nelson and H.L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117–126, 2013.
[25] J. Nelson and H.L. Nguyen. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pages 101–110, 2013.
[26] D. Newman. Bag of words data set. https://archive.ics.uci.edu/ml/datasets/Bag+of+Words, 2008.
[27] H. Song. Robust visual tracking via online informative feature selection. Electronics Letters, 50(25):1931– 1932, 2014.
[28] S. Suthaharan. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, volume 36 of Integrated Series in Information Systems. Springer US, Boston, MA, 2016.
[29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1113–1120, 2009.
In Appendix A, we prove our corollary regarding dimension-sparsity tradeoffs and discuss some of the subtleties of Theorem 1.5. In Appendix B, we show that the Hanson-Wright bound is too loose to prove Theorem 1.5. In Appendix C, we state and prove useful moment bounds that we use throughout the analysis. In Appendix D, we prove our moment bounds for in Lemma 2.1 and Lemma 2.2. In Appendix E, we prove our moment bounds for
in Lemma 2.3 and Lemma 2.4. In Appendix F, we prove auxiliary lemmas needed in the proof of Lemma 2.3. In Appendix G, we prove auxiliary lemmas needed in the proof of Lemma 2.4. In Appendix H, we prove our simplified moment bounds for
in Lemma 3.1 and Lemma 3.2. In Appendix I, we provide additional experimental results on real-world and synthetic datasets as well as additional discussion.
We discuss some of the subtleties of Theorem 1.5. When , where
, we show that
, which means that the norm-preserving condition holds on the full space. This generalizes Cohen’s bound [8] to a slightly more general family of sparse JL distributions, as we discuss below. When
, we show that
. For the remaining regimes,
and
, our upper and lower bounds on
match up to constant factors.
In terms of the boundaries between regimes, we emphasize that in Theorem 1.5, the function may not be defined for certain intervals between the boundaries of regimes, since there may be different absolute constants in different boundaries. More specifically, these intervals are
,
, and
. These gaps arise because the boundaries between the regimes on our upper and lower bounds on
can have different absolute constants, so we don’t have precise control on
in these gaps. Nonetheless, the gaps only span a constant factor range on the exponent in the dimension m.
We now state the dimension-sparsity tradeoffs that follow from our bounds:
Corollary A.1 Suppose that and
are sufficiently small and
. If
is any sparse JL distribution, then
when
. If
is a uniform sparse
JL distribution, then when
, apart from a constant-factor interval
where we do not have a bound on the behavior of sparse JL.
Proof of Corollary A.1. The first statement follows from the fact the lower bound in Theorem 1.5 holds for any sparse JL distribution. For the upper bound, we also use Theorem 1.5. Let’s set , where
is the implicit constant in the upper bound. This solves to
We also have the condition that for this regime to be reached. We can obtain the max with 1 on the exponent, by using that
when
. To avoid having a gap when
, we implicitly use that our lower bound actually doesn’t have a gap between these regimes (though there may be a gap in the boundary between the lower bound and upper bound). Thus, we only have to keep the gap
where we do not have a lower bound.
Notice that the upper and lower bounds in Corollary A.1 also match up to constant factors on the exponent in the dimension m.
Though Cohen, Jayram, and Nelson [9] also view as a quadratic form, we show that their approach is not sufficiently precise for our setting. They upper bound the moments of
by the
gaussian case through considering:
where the are i.i.d standard gaussians. They use the fact Rademachers are subgaussian to conclude that
. In order to obtain upper bounds on
, they use the Hanson-Wright bound, a tight bound on moments of gaussian quadratic forms. However, we need different technical tools for two reasons.
1. First, in order to upper bound , we need to
, and thus cannot simply consider
.
2. Second, even to lower bound , using
as a upper bound for
is not sufficiently strong. Below, we give a counter-example, i.e. a vector x, where
is too large to recover a tight lower bound.
Thus, we cannot use the Hanson-Wright bound in this setting, and need to come up with a better bound on that does not implicitly replace Rademachers by gaussians. The second point is similar in flavor to the conceptual point made in [14], where a sign-consistent variant of sparse JL was analyzed using an upper bound for Rademacher quadratic forms. However, the bound in [14] also turns out to be loose in this setting and also can’t be used to obtain either a sufficiently tight upper bound or a lower bound for
.
We now show point (2): that the Hanson-Wright bound is not sufficiently strong to obtain a lower bound . We consider
, as above, where the
are i.i.d standard gaussians. We consider p equal to
rounded up to the nearest even integer, and we consider a vector of the form [v, . . . , v, 0, . . . , 0] where
is an integer and
. We show
for a certain v value, where we know it to be true that
.
Let’s consider a vector [v, . . . , v, 0, . . . , 0] where is an integer and
. We apply the Hanson-Wright bound (which is tight for gaussians) to obtain:
Let . Let
be the set of indices where
. We can set the vector to
for all
and 0 elsewhere. This gives us:
We can expand out this moment to obtain:
Since , we know that
. Moreover, as long as
, we know that
. Thus we obtain a bound of
We show that when s = 1, the bound will produce
. At this v value, we know that:
The key quadratic form bound for Rademachers that we use is:
Lemma C.1 Let T be an even integer, be independent Rademachers, and
be a
symmetric, nonnegative random matrix with zero diagonal (i.e.
) such that
is independent from
. If
, then:
where W. . .
. . . W
is a permutation of W
, . . . , W
.
We derive Lemma C.1 from Latała’s bound on Rademacher quadratic forms [21]. In fact, Latała shows moment bounds for much more general quadratic forms, but for the application to JL, we only need the following bound in the special case of Rademachers:
Lemma C.2 ([21]) Let T be an even natural number. Let be independent Rademachers and let
a symmetric matrix with zero diagonal. Then:
where Aa
and A
. . .
. . . A
is a permutation of A
, . . . , A
.
To prove Lemma C.1, we apply Lemma C.2 to the case where the are themselves random variables:
where the last line follows from the fact that the the are nonnegative, so each term is nonnegative, so the triangle inequality results in at most a factor of 2 of gain.
Now, we consider linear forms of symmetric random variables. Theoretically, moments of these forms can be derived from Theorem 2 in [20] (a tight bound on moments of weighted sums of symmetric random variables). However, reducing the tight bound to the form that we want would require some simplifications. Instead, we give a direct proof of a weaker bound that is sufficiently tight for our setting.
Proposition C.3 Suppose that is an integer. Suppose that
are i.i.d symmetric random variables and suppose that
satisfies
and
. Then, we have that
Proof of Proposition C.3. Let . Observe that
Now, we use the fact that and the condition on k to obtain that this is bounded by
We now bound moments of squares of linear forms with a zero diagonal, i.e. . This structure of random variable theoretically falls under the scope of Lemma C.1. However, as mentioned in Section 2.1, the first term of C.1, which is an operator-norm-like term for an asymmetric random matrix in this setting, becomes intractable to manage. We give an alternate (weaker) upper bound that is both tractable to analyze and sufficiently tight for our setting. Our proof of this bound is similar to our proof of Proposition C.3 presented above. Since random variables with a zero diagonal are common in the JL literature [19, 3, 24], we believe this moment bound could be of broader use.
Lemma C.4 Suppose that are i.i.d symmetric random variables and suppose that
satisfies
and
. Let T be an even natural number. Then, we have that
Proof of Lemma C.4. Let . Observe that
Now, we use the fact that and the condition on k to obtain that this is bounded by
Latała [20] gives the following nice bound on sums of i.i.d symmetric random variables that we use for combining bounds on rows in Lemma 2.3.
Lemma C.5 ([20]) Suppose that q is an even natural number. Suppose that are i.i.d symmetric random variables. Then:
We give a general lower bound on moments of certain (potentially correlated) sums of identically distributed random variables, that we use in proving Lemma 2.4.
Proposition C.6 Let be identically distributed (but not necessarily independent) random variables, such that the joint distribution is a symmetric function of
and for any integers
, it is true that
. For any natural number q and natural number T that divides q, it is true that
Proof of Proposition C.6. The proof follows from expanding and using the fact that
0 so that we can restrict to a subset of the terms. By the symmetry of the joint distribution, we know that for
, we know that
. The number of terms of the form
in
is:
This implies that
and the statement follows from taking 1/qth powers.
We prove a lemma involving the Paley-Zygmund inequality applied to pth moments, that we use implicitly in the proof of the upper bound in Theorem 1.5.
Lemma C.7 Suppose that K > 0 and Z is a nonnegative random variable, such that and
is finite. Then,
We use the Paley-Zygmund inequality, which says the following:
Lemma C.8 (Paley-Zygmund) Suppose that Z is a nonnegative random variable with finite variance. Then,
Proof of Lemma C.7. We apply Lemma C.8 to to obtain that:
If , then we know that
and then we can apply the above result.
We analyze the moments of , proving Lemma 2.2 and Lemma 2.1. Our lower bound in Lemma 2.2 holds for
as well as
(for technical reasons discussed in Appendix E). Our upper bound in Lemma 2.1 holds for
. In Appendix D.1, we prove Lemma 2.2. In Appendix D.2, we prove Lemma 2.1.
D.1 Proof of Lemma 2.2
The key ingredient of the proof is Lemma C.1 (for Rademacher quadratic forms). We can view as the following quadratic form:
where . Since the support of
is {0, 1} and due to symmetry of this random variable, it is tractable to analyze the expressions in Lemma C.1.
Proof of Lemma 2.2. First, we handle the case of T = 2:
as desired.
is equal to
where the last line follows from the fact that since and
, we know that:
Setting t = T/M, we obtain, up to constants:
We can take a derivative to obtain the two expressions in the lemma statement at the following regimes of parameters: and
. The second regime aligns with the lemma statement. Thus it suffices to show that when
straightforward calculation.
the above calculations, without taking the sum that we obtain a lower bound of
D.2 Proof of Lemma 2.1
In Section 2.1, we discussed the tractability issues with using the general quadratic form moment bound Lemma C.1 to upper bound . Thus, we require simpler bounds that are easier to analyze. Linear forms naturally arise in the upper bound since
. However, it turns out that a vanilla linear form bound (e.g. Proposition C.3) here is weak due to the loss arising from ignoring the
term. Thus, we use Lemma C.4 (our generalized bound tailored to squares of linear forms with a zero diagonal) to obtain:
Lemma D.1 If and
, then we have that:
Proof. This can be seen by simply taking in Lemma C.4.
on in Theorem 1.5. The lower bound on moments of
in Lemma 2.2 sheds light on where this loss may be arising. We see that the problematic case is when
we require a new bound for this regime. Since the vector is in
when
, we can’t hope to beat the bound of
from Lemma 2.2. We show that we can match this value:
Lemma D.2 Suppose that satisfies
and
. If
,
, then:
many smaller that are still allowed to be present. We separate out
In the quadratic form formulation of , this separation cannot be carried out, since there would
is very close to the value where
, so this approximation is essentially tight.
be cross-terms between
Proof of Lemma D.2. WLOG, assume that . Let
. We know that
For , we use the bound
terms, we take in Proposition C.3 to obtain the following upper bound
for
and
:
Based on the conditions in this lemma statement, we know that . Thus taking a derivative, we obtain that this can be upper bounded by taking
which yields:
Finally, combining Lemma D.1 and Lemma D.2 yields Lemma 2.1:
Proof of Lemma 2.1. We apply Lemma D.1 at T = 2 to directly obtain , and for
, we apply Lemma D.1 and take a derivative to obtain:
To obtain the desired bound, we also include the bound from Lemma D.2 in the middle regime.
Now, we show to move from bounds on moments of individual rows (i.e. ) to bounds on moments of
. In Appendix E.1, we obtain an upper bound on
, thus proving Lemma 2.3. In Appendix E.2, we obtain a lower bound on
, thus proving Lemma 2.4.
E.1 Proof of Lemma 2.3
Since the are negatively correlated, we can always upper bound the moments of
by the case of a sum of independent random variables when q is even
.
s , . . . , x
where the last inequality follows from Lemma C.5. Thus, it remains to analyze the sup expression. It turns out that each regime of bounds in Lemma 2.1 collapses to one value, so the different regimes in Lemma 2.1 correspond to different parts of the max expressions in Lemma 2.3. Depending on the parameters, some of these regimes may not exist, as is reflected by branches of the max expression sometimes vanishing in Lemma 2.1. We defer the computation to Appendix F.
E.2 Proof of Lemma 2.4
Moving from a lower bound on the moments of individual rows given by Lemma 2.2 to moments of R(v, . . . , v, 0, . . . , 0) is more delicate. Unlike in the upper bound, the negative correlations between random variables require some care to handle, even with the simplification that the s nonzero entries in a column are chosen uniformly at random. For example, the conditional distribution of is 0, while the marginal distribution of
has expectation s/m. One aspect that simplifies our analysis is that we know from our proof of Lemma 2.3 which moments of
are critical in the sup expression in (2). We only need to account for these particular moments in our lower bound approach. It turns out that the three critical values are q/T = 2, q/T = q, and
.
For q/T = q, where rows are isolated, we can directly obtain a bound from Lemma C.6 and Lemma 2.2 to obtain.
Lemma E.1 Suppose is a uniform sparse JL distribution. Suppose that q is even,
,
Proof. By Lemma C.6 with T = 1, we have that:
Now, we apply Lemma 2.2 to obtain the desired expression.
For q/T = 2 and , we make use of the Lemma E.2 that relates moments of products of rows to products of moments of rows by taking advantage of either s and
being sufficiently large. The method essentially uses a counting argument to show that not too many terms vanish as a result of negative correlations, and requires adding in an indicator for the number of nonzero entries in a row being 2 for some cases (which is sufficient to prove Lemma 2.4).
Lemma E.2 Suppose is a uniform sparse JL distribution. If
is an integer, q/T is an even integer,
is an even integer, and
, then:
We defer the proof to Appendix G. Now we can use Lemma C.6 coupled with Lemma E.2 and Lemma 2.2 to handle the cases of q/T = and obtain the following bounds. For q/T = 2, we obtain:
Lemma E.3 Suppose is a uniform sparse JL distribution. If q is an even integer,
, and
is an even integer, then it is true that:
Now, by Lemma 2.2, we can see that . Thus, our bound becomes:
For , we similarly obtain the following bound using Lemma C.6 coupled with Lemma E.2.
Lemma E.4 Suppose is a uniform sparse JL distribution. Suppose that q is a power of
,
is even,
, and
. Then it is true that:
Proof. Let’s let f(x) be the function that rounds x to the nearest power of 2. By the conditions, we know that . Now, we want the condition
to be satisfied. If
, then this is implied by
, which is a strictly weaker condition than the one given in the lemma statement. If
, then
and so
gives the desired condition.
We use the fact that . We apply Lemma E.2 and Lemma C.6, with
and Lemma 2.2 to see that if we have the additional condition that
, then we know that:
With these bounds, Lemma 2.4 follows.
Proof of Lemma 2.4. We combine Lemma E.1, Lemma E.3, and Lemma E.4.
First, we use Lemma C.5 and Lemma 2.1 to prove a upper bound that is not quite in the desired form for Lemma 2.3.
Lemma F.1 Let be an even integer and
and
. If
, then:
If then we have
In all other cases, we have that
The functions are defined as follows.
Proof of Lemma F.1. As we discussed in Appendix E, it suffices to bound
Our bounds on are based on Lemma 2.1. We split into cases based on the T value, and how it separates into different cases in Lemma 2.1. Let
Let’s first consider . In this case, only the
branch arises. Now, suppose that
. Suppose that
. Then we show that the
branch does not arise. It suffices to show that
for all
. Let
. It suffices to show that
for all
.
Since at
and this is an increasing function of x, we know that the condition is true. We now produce bounds
such that
, which is what we do for the remainder of the analysis.
First, we handle the term. We see that
Now, we handle the term. We obtain a bound for
. The expression becomes:
Suppose that . In this case, we have that this expression is upper bounded by
. When we plug this into the expression, we obtain
. Otherwise, if
, then this expression is upper bounded by T = 3:
We know that that because this reduces to
Now, we handle the term when
.
If , this is bounded by 1, and if
, this is bounded by
. We see that
,
Now, we handle the term. In this case, the expression becomes:
We use some function bounding arguments to come with a simpler bound for for sufficiently large v.
Lemma F.2 Assume that for some
. Then it is true that
Proof of Lemma F.2. With the assumptions that we made we know that . This implies that our expression becomes:
Let be the minimum T such that
. We just need to bound
First, we handle the second term. Let . We use that
, so
to conclude
. We see that
We see that setting Q to its maximum value achieves within a factor of e of the maximum value of . Thus, we obtain that this is upper bounded by
.
Now, we just need to handle the first term. If , then this term doesn’t exist. Let’s take a log of the expression to obtain:
The sign of the derivative is the same as:
Since , we know that
. Thus, we know that
. Since
, we know that
, so
. Thus, the derivative is negative, so the sup is attained at
, where the expression is:
Thus, to upper bound by , it suffices to show:
Now, we combine Lemma F.1 and Lemma F.2 to prove Lemma 2.3.
Proof of Lemma 2.3. First, we compute the second moment by hand:
For , we apply Lemma F.1 and Lemma F.2. We only include
when
to simplify the bound. The bound follows.
We prove Lemma E.2.
Proof of Lemma E.2. First, we show the following fact: Suppose that there are T distinguishable buckets and we want to a assign an ordered pair of 2 unequal elements in [N] to each bucket so that the total number of times that any element shows up is
. We show that the number of such assignments is at least
for some constant C. To prove this, we first consider the case where
. In this case, we have that the number of such assignments is at least:
Now, if N < 2T, then we define:
We partition 2T into blocks, each of size N, until potentially the last block, which may be smaller. We can read off ordered pairs assigned to each bucket from this formulation. Let’s assume that each block is a permutation of 1, . . . , N, and the last block is
non-equal numbers drawn from 1, . . . , N. (this satisfies the unequal ordered pair condition). Then the number of assignments is
. This is at least as big as
for some constant
.
:
We know that
where has expectation 0. In this case we have that
where Q consists of terms that contain a factor of some . Due to the independence of the Rademachers, the expectation of any term that contains a factor of
has expectation 0, which implies that:
We know that
where consists of terms that contain a factor of some
. For similar reasons, this implies that
Let’s view and
as terms in a sum. In the second expression, every term has expectation
, and there are at most
terms. In the first expression, if there are > s copies of any
value, then the expectation is 0. Otherwise, the expectation varies between
and
. By the counting argument at the beginning of the proof, we know that there are at least
terms. This implies that
as desired.
for :
where has expectation
. In this case we have that
where Q has expectation . This implies that:
where consists of terms that contain a factor of some
. For similar reasons to the above, we have that:
Let’s view and
as terms in a sum. In the second expression, every term has expectation
(the indicator can only reduce the expectation), and there are at most
terms. In the first expression, if there are > s copies of any
value, then the expectation is 0. Otherwise, the expectation varies between
and
. By the counting argument, we know that there are at least
terms. This implies that
as desired.
Recall that our proof of Theorem 1.5 requires cleaner bounds on moments of that follow simplifying the bounds in Lemma 2.3 and Lemma 2.4 at the target values of v. The proofs of these lemmas boil down to function bounding and simplification.
H.1 Proof of Lemma 3.1
First, we show how Lemma 2.3 implies Lemma 3.1. The proof involves simplifying and bounding the function at the target v value.
Proof of Lemma 3.1. We plug q = p into Lemma 2.3. We use this relaxed version of the bound: If , then
. Otherwise, if there exists
, then
Suppose that the absolute constant on the upper bounds is . Let
(we take C to be the constant on the upper bounds). Let’s take
,
First, let’s analyze . We show that
. Observe that
. Using the fact that
Now, since , this implies that
Moreover, we know that , since
. Now, we show that
. Let’s observe that
Now, we handle the case where
using that , this immediately follows from
. Otherwise, we need it to be true that
. This can be written as
. Since
, this can be written as:
m e
:
only exist if . Observe that we can set
and using the fact that
, we obtain that
Thus, this is lower bounded by 1 when .
First, we analyze the case of . We show that
. Observe that
where the last inequality uses the fact that .
bound if . First, we show this is an increasing function of v. Let
. We see that
. We observe that this is an increasing function of w as long as
, which is exactly
if . First, we show that
if
. Let
. We see that
this is bounded by at most a factor of 2 above any other w value.
that .
Otherwise, we know that . First let’s show that that
. We know that
. At
, we know that the expression is upper bounded by
. Since the
term is an increasing function of v in this regime, this means that we get a bound of
in this case too. Thus, we know that:
Now, suppose that . We’ve already shown that
here (near the beginning of the proof). Since
, we obtain a bound of
. This means:
H.2 Proof of Lemma 3.2
Now, we show how Lemma 2.3 and Lemma 2.4 imply Lemma 3.2. The proof simply involves bounding and simplifying the functions in the original lemmas at the target v value.
Proof of Lemma 3.2. We use Lemma 2.3 but put in an absolute constant. Let be such that: if
, then
Otherwise, if , then
is upper bounded by:
We use Lemma 2.4 but put in an absolute constant (which we take to be
). Let
be an even integer, and suppose that
and
is an even integer. If
, then
If m q,
qmv
q,
qmv
, and s
m/e then:
First, we handle the case where . Let’s take
for any sufficiently small
. By sufficiently small, we mean
and
. This implies that
and
. Thus we know
(using that q m) that
, . . . , x
as desired. Suppose that . Based on the setting q, this means that
us to assume that and
. Let
and let
consider and
. First, we handle the condition of
. We enforce the condition
. Assuming that
(which is true at the two values of v that we consider), we
know . Also, we make
, so that
Consider . We first check that the conditions for the upper bound are satisfied. We have that
. Also, we have that
needed for the lower bound. Observe that
as desired. We check that . It suffices to show that
Using the condition that where we obtain that
as desired. Now, we compute the value of at
. We obtain:
that . Observe that when
and
, this is lower bounded by
, so
. Now, we claim that when
, we show that
. In this case, using that
, we have:
Observe that
At this value, observe that:
Let . Let’s set
. Using the fact that
(so
), this means that
has can take on at least 3 different powers of 2. Let’s observe that when
(we can get this condition by saying that
for a sufficiently large
) and
(we can get this condition by saying that
for a sufficiently large
), we know that
Suppose that (we can get this condition by saying that
for a sufficiently large
) and
(we can get this condition by saying that
for a sufficiently large
). Let’s observe that
Let . When
, we know that
and when
, we know that
.
In order to plug in and use the
lower bound, we need to show that
. At
. At this value, observe that:
as long as . Thus, it suffices to show that
. When
, we know that
For the upper bound, we see that and
Now, we use the fact that to see that:
We also observe that since , we know:
This, coupled with the guarantee on, implies we have an upper bound of:
Moreover, we have that
The next case is and
. We set
. Since
, we know that
. Thus we know:
For the upper bound, we know that:
To make these bounds compatible, we need to handle the case where better. Let
. Assuming that
, we know that
can be upper bounded by:
as long as (which we can make true by appropriately setting the constants on the bound for m). Observe also that:
This, coupled with the guarantee on, implies that our upper bound becomes:
v, . . . , v, 0, . . . ,
We now show that we can tweak within the factor of
range permitted to show that we can ensure that it is not true that
. Observe that multiplying by a factor of
in this case yields
and dividing by a factor of
yields
. Thus, at least one of the
values that yields a power of 2 for
will work. Thus, we have that
Moreover, we have that:
The next case is that . We set
For the upper bound, we see that . We know:
This can be relaxed to:
Now, we know that
This coupled with what we know aboutmeans that:
Moreover, we have that
We use the condition on q not being more than a constant factor away from , to conclude that
move within the notation as well.
All of the experiments (in Section 4 and in this section) were run on the default hardware on a Google Colab notebook. The code is available at https://github.com/mjagadeesan/sparsejl-featurehashing.
First, we give the results of additional experimental results on real-world and synthetic datasets, using the same experimental setup as Section 4.
Figure 5: Phase transitions of
Figure 6: Phase transitions of
For the synthetic datasets, the trends in Figure 5 and Figure 6 look quite similar to the figures in Section 4. We see, though, that Figure 6 experiences more severe non-monotonic behavior as a function of s in the second phase transition. Consider, for example, in Figure 6, the behavior at m = 12000: we see that . In fact, the order of the phase transitions in Figure 6 is far from decreasing. Nonetheless, the general patterns and trends in the theoretical result still hold (e.g. the “flat” part occurs at a lower y-coordinate for lower s values.)
For the real-world datasets, the trends in Figure 7, Figure 8, and Figure 9 look quite similar to the figures in Section 4. One slight difference is that the failure probability noticeably increases in Figure 7 and Figure
Figure 7: on News20
Figure 8: on Enron
Figure 9: on News20
8 between s = 8 and s = 16. It turns out that the failure probability actually increases to a local maximum somewhere in , and then decreases when
, reaching lower than the value at s = 8 by the time s = 20. There turns out to be a similar local maximum phenomenon when
and m = 500, though the local maximum occurs in
and thus is not as visible in the graph.
As a general comment on non-monotonicity as a function of s, we emphasize that our asymptotic theoretical results characterize the macroscopic behavior of , and do not preclude the existence of constant factor fluctuations for small changes in parameters. An interesting direction for future work would be to look further into this non-mononocity and try to characterize when it arises.