Random linear mappings play a central role in dimension reduction, compressed sensing, and numerical linear algebra due to their propensity to preserve the geometry of a given set. The performance of a random linear mapping is often determined by the uniform concentration bound of
for all vectors in a set of interest (in other words, how close the map is to being an isometry on the set). This is now well-understood by the standard techniques in the Gaussian random matrix case [9,30,32]. However, in many applications, non-gaussian random mappings are more useful because of their computational/storage benefits or simply the difficulty to generate Gaussian matrices using sampling devices [17]. For example, sparse or structured random matrices are preferred in both dimension reduction [8] and random sketching in numerical linear algebra [1,14,25,34] since they provide more efficient matrix multiplications than dense and unstructured matrices such as Gaussian ones. Certain formulations in compressed sensing also naturally require random matrices such as randomly subsampled Fourier measurements [18] or Bernoulli random matrices [28].
There has been a series of recent works [8, 20, 24] to demonstrate the effectiveness of random mappings outside the Gaussian setup. Unlike the Gaussian case in which we have a rotation invariance property, non-Gaussian setups require more sophisticated arguments to address various new technical challenges. In this article, we will be focusing on sub-Gaussian random mappings.
Let us recall some definitions. For -norm (which is the Orlicz norm taken with respect to function exp(1) of a random variable X is defined as
In particular, = 2 gives the sub-Gaussian norm and = 1 gives the sub-exponential norm. The random variable X is called sub-Gaussian if and called sub-exponential if
For sub-Gaussian random variables, the -norm roughly measures how fast the tail distribution decays – usually the bigger -norm is, the heavier the tail. We will repeatedly use the fact that if and only if the tail probability ) is bounded by a Gaussian with standard deviation in the order of K. A precise statement of this, along with some other properties of -norm, can be found in Appendix A.
The sub-Gaussian norms for many random variables can be calculated by looking at the moment generating function of their squares. For example, the sub-Gaussian norm for ) it is log; for Rademacher random variable it is log
for any bounded (by M) random variable it is no more than is not a sub-Gaussian random variable, but has sub-exponential norm For a random vector is sub-Gaussian if
We say a random matrix is isotropic and sub-Gaussian if its rows are independent, isotropic and sub-Gaussian random vectors in . The sub-Gaussian parameter of A is defined as
For random matrix , the isotropic condition guarantees will preserve Euclidean norm in expectation. Some examples of isotropic and sub-Gaussian matrices are matrices whose entries are independent and sub-Gaussian with = 1, uniformly subsampled (with replace- ment and after proper normalization) rows of orthonormal basis or tight frames, etc. [32]. In the cases of Bernoulli matrices or sparse ternary matrices, which is a generalization of the database-friendly mappings in [1], the sub-Gaussian parameter can depend on the signal dimension n if the probability of an entry being nonzero is n-dependent.
In the line of research regarding sub-Gaussian random mappings, Liaw et al. [20] showed that for isotropic and sub-Gaussian mapping A with sub-Gaussian parameter have with high probability,
Here w(T) is the Gaussian width given by
which is the radius when T is symmetric.
Gaussian width measures the complexity of a set. In particular, denote cone(0) is a meaningful approximation for dimension [6,24]. Generally rad(T) is also dominated by w(T). For example, if 0 , then by Jensen’s inequality,
In such case, (1) implies that with high probability, is a near isometry on T whenever ) for some constant C.
The dependency on w(T) in (1) is optimal. This is easy to see when m = 1 and A has i.i.d. Normal(0, 1) entries. But when it comes to the dependency on the sub-Gaussian parameter K, whether the factor can be improved is a question raised but left unanswered in [20]. Other important works regarding this type of bounds are either not explicit [15,24] or at least of the same order
In this article, we refine this dependency on the sub-Gaussian parameter from to the optimal . This enhances the concentration bound substantially when the sub-Gaussian mapping is not well-behaved, for example, when K increases together with the signal dimension. We also relax the row-independent requirement by considering random mappings in the form of BA where B is an arbitrary matrix and A is mean zero, isotropic and sub-Gaussian. The mean zero assumption is additional when comparing to the assumptions for (1), and not needed when B is only diagonal. However, it is necessary for arbitrary B. Our bound is broadly applicable since it only require these properties from the random matrix A without any other assumptions.
Now we state our main theorem. In the following, denote Frobenius and operator norm of B respectively. The matrix is diagonal means that the only possible non-zero entries are
be a fixed matrix, let be a mean zero, isotropic and sub-Gaussian matrix with sub-Gaussian parameter be a bounded set. Then
Here C is an absolute constant. Furthermore, when B is a diagonal matrix, random matrix A only need to be isotropic and sub-Gaussian with sub-Gaussian parameter K for the conclusions to hold.
When B is the identity matrix, we have the following corollary.
be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter be a bounded set. Then
The bound appearing on the right hand side of Theorem 1.1 is optimal in general. The factor is optimal and this is easy to see when B has non-zero singular values being all equal (because the statement should be invariant under scaling for B). We also give another example below in which the singular values are not all equal. The dependency on K is optimal and this follows from Proposition 4.5 in Section 4.4 with T being a singleton. As mentioned before, rad(T) is generally dominated by w(T) and the dependency on w(T) is also optimal.
Assuming rad(T) is dominated by w(T), Theorem 1.1 then implies that with high probability, matrix is a near isometry on T whenever the stable rank of B
This result recovers (1) with improved dependency on
has i.i.d. entries where
It is easy to verify that has isotropic rows . Moreover, for any , notice that
On the other hand, . So in this case, Theorem 1.1 does not hold when m is sufficiently large.
As an example demonstrating is optimal in general, consider the case when A is standard gaussian so that Also let be the coordinates of g, then
where we used Jensen’s inequality in the second last line. This estimate is in the order of when with some constant C large enough.
isotropic and sub-Gaussian conditions of A guarantee that K is bounded below from 1. To see this, let has independent coordinates satisfying
we can conclude that and the equality is achieved when
The proof for Theorem 1.1 follows an analogous approach in Liaw et al. [20]. One major difference is that we prove and apply two new concentration inequalities with improved parametric dependency in the sub-Gaussian regime. We believe these inequalities are interesting on their own as an application-oriented concentration inequality.
The first one is a new Bernstein type inequality under bounded first absolute moment condition. This inequality provides a concentration bound for sum of sub-exponential random variables.
Theorem 1.3 (New Bernstein’s Inequality)be a fixed non-zero vector and let be independent, mean zero sub-exponential random variables satisfying . Then for every
Remark 1.4. Theorem 1.3 remains true (with a different absolute constant c) when the 2 in is replaced with an arbitrary positive constant (see Remark 2.2 for more detail).
The second one is a new Hanson-Wright inequality under unit variance condition. This inequality provides a concentration bound for quadratic forms of independent random variables and is more general than the aforementioned Bernstein’s inequality. In the literature, results of similar flavor have been obtained [3,16,33] but under different assumptions. We will give a brief comparison between our result and a few notable ones in Section 3.
Theorem 1.5 (New Hanson-Wright Inequality)be a fixed non-zero matrix and let be a random vector with independent, mean zero, sub-Gaussian coordinates satisfying . Then for every
Remark 1.6. If A is a diagonal matrix, then Theorem 1.5 recovers Theorem 1.3 (assuming all are equal) with . Therefore this can be viewed as a generalization of the new Bernstein’s inequality given in Theorem 1.3.
We use for Euclidean norm of vectors, for Frobenius and operator norm of matrices respectively. We use for Hadamard (entrywise) product. We say some absolute constant for some absolute constant C. Typically, c and C denote absolute constants (often c for small ones and C for large ones) which may vary from line to line.
The rest of this paper is organized as follows: In Section 2, we discuss and prove the new Bernstein’s inequality (Theorem 1.3). In Section 3, we first discuss and compare the new Hanson-Wright inequality (Theorem 1.5) to other known variants of Hanson-Wright inequalities and then prove Theorem 1.5. In Section 4, we prove our main theorem regarding sub-Gaussian matrices on sets (Theorem 1.1) and show that our dependency on K is optimal. In Section 5, we demonstrate how our result can improve theoretical guarantees of some popular applications such as JohnsonLindenstrauss embedding, randomized sketches and blind demodulation. In Section 6, we make a brief conclusion for this paper.
In this section we prove the new Bernstein’s inequality Theorem 1.3. Let us first recall the standard Bernstein’s inequality for sub-exponential random variables [32, Theorem 2.8.2], which states that for independent, mean zero, sub-exponential random variables and a vector a = (
where
Compared to (2), Theorem 1.3 has an extra assumption on the first absolute moment of namely 2, but it improves the dependence on K in the sub-Gaussian regime from . It is worth noting that such extra assumption comes naturally when considering isotropic random matrices/vectors. In fact, let x be a fixed point on the unit sphere and let be isotropic random vectors of the same dimension, then 1 is mean zero since is isotropic, and + 1 = 2 by triangle inequality.
We will first bound the moments of , then bound their moment generating functions, and finally use Chernoff method to obtain the desired tail bound.
The idea here is to write the moment as an integral and then estimate under the two constraints
Lemma 2.1 (Moment Bounds). Let Y be a sub-exponential random variable satisfying and
2, we have
Also, since , a change of variable
For the p-th moment of |Y |, with a change of variable
We will split this integral into two parts. Set monotonically increases on [0, T], we have
On the other hand, since
Combining these two parts completes the proof with
Let Y be the random variable as in Lemma 2.1, the moment generating function of Y can be estimated through Taylor series
Here the first inequality is by Lemma 2.1 (with 6) and the second inequality uses When ), the above summation converges and we have
for absolute constants
where are absolute constants. When we minimize the above expression over the optimal value
Next we plug in into (6) to get
The bound for ) is similarly obtained by considering instead of . This com- pletes the proof.
Remark 2.2. If the random variables have first absolute moment , then the right hand side of Equation (3) becomes and it is easy to see that Lemma 2.1 still holds with follows that the in Step 2 will be no more than 6 + and Theorem 1.3 now holds with constant
Hanson-Wright inequality gives a concentration bound for quadratic forms of random variables. The version in [27] states that for a random vector with independent, mean zero, sub-Gaussian coordinates, suppose maxreal matrix, then
In the same spirit as the new Bernstein’s inequality, we can improve the tail dependency on K in the sub-Gaussian regime from under a further assumption = 1 for each This is the new Hanson-Wright inequality Theorem 1.5.
It is not difficult to drop the requirement = 1 in Theorem 1.5 by a simple scaling, in which case we have the following corollary.
be a random vector with independent, mean zero, sub-Gaussian coordinates satisfying 0 , then for fixed square matrix A,
and define diagonal matrices
Then ˜satisfies the assumption of Theorem 1.5 with . Applying Theorem 1.5 to ˜completes the proof.
Let us first compare Corollary 3.1 to the standard Hanson-Wright inequality (8) in the case when = 1. The concentration bound in (8) implies that, with probability at least 1
Meanwhile, Corollary 3.1 implies that
where . Note that , so this bound improves the parameter dependence (up to a log factor) in the sub-Gaussian regime from . Such improvement can be significant when is far less than K.
Other variants of Hanson-Wright inequality have appeared in literature with similar improvements [3,33]. In particular, one of the results by Adamczak [3] works under the assumption that X satisfies the convex concentration property with constant ˜K, that is, for every 1-Lipschitz convex function , we always have
Then under such assumption,
where Cov(X) is the covariance matrix of X. When X has independent and mean zero coordinates, However, the convex concentration property is not the same as subGaussianity. More precisely, while it is true that ˜K is independent of dimension when X has i.i.d. coordinates which are bounded almost surely [29], this can fail when the boundedness assumption of is replaced by sub-Gaussianity (i.e. ˜K could depend on the dimension of are i.i.d. and sub-Gaussian) [2,13]. Therefore the bound in (11) does not imply (10) nor (9) in general.
In a more recent paper by Klochkov and Zhivotovskiy [16], the authors proved a uniform version of the Hanson-Wright inequality which, when applying to a single matrix under the same assumption as Corollary 3.1, yields the following bound:
where . This bound also improves (9) in some cases as demonstrated in [16]. We shall compare this bound to (10) in the sub-Gaussian regime. On one hand, Jensen’s inequality tells us that the factor in (12) is bounded by the factor in (10). On the other, the factor M in (12) is only bounded by , which could depend on dimension n. Moreover, (12) only provides a one-sided bound instead of two-sided concentration bounds like Equations (9) to (11).
The main idea of proof is similar to [27], that is to divide the sum into diagonal and off-diagonal, then bound the moment generating function of the latter through a decoupling and comparison argument. However, there are two significant differences. The first difference is the random variables used for comparison. We will use scaled Bernoulli multiplied by standard Gaussian in order to preserve the condition of second moment being 1. Using such random variables also leads to challenges in bounding the moment generating function, which is the second difference.
Now we proceed with the proof. For any t > 0, let
be the the tail probability we want to bound. Let ) be the diagonal of A and let
We will seek bounds for
The bound for is given by our new Bernstein’s inequality. Notice that
where . So by Theorem 1.3 and the simple relationships between the norms of
To bound , we will derive a bound for the moment generating function of
be an independent copy of X, then
The above follows directly from the following decoupling lemma.
Lemma 3.2 (Decoupling [32])matrix, and let be a random vector with independent mean zero coordinates. Then for every convex function
See Theorem 6.1.1 and Remark 6.1.3 in [32] for a proof of Lemma 3.2.
We will compare ) to scaled Bernoulli multiplied entrywise by standard Gaussian. But first let us look at the case of a single variable through the following lemma.
Proof. Using the inequality ), which is true for all (see Appendix C), we have
By Lemma 2.1 we know that for any positive integer q and some absolute constant
On the other hand, a direct calculation gives
Choosing any completes the proof.
Now let be random vectors such that has entries be independent copies of g and r. Let be any vector in , by Lemma 3.3 and independence we have
Note the above also holds for ). Therefore
where ). Here the two inequalities are repeated applications of Equation (13).
Denote ) the singular values of matrix . From the rotation invariance of g and
For standard normal random variables
where the inequality uses (1 ) (see Appendix C).
Also note that
Next, use the following Lemma 3.4 (with ) to bound the moment generating function of and we obtain
Lemma 3.4. Let D be a diagonal random matrix with i.i.d. entries let be an independent copy of D. Given a fixed matrix A, then
. Notice that
so for
Here the second last inequality uses 1]. The last inequality uses 1 +
From previous steps we get
Optimizing this over (similar to proof of Theorem 1.3) yields a one sided bound for . The other side can then be obtained by considering ) instead of . Together they give
Lastly, since , combining the bounds for and then applying inequality min(see Appendix C) complete the proof of Theorem 1.5.
In this section we prove Theorem 1.1 and show that the dependence on K is optimal. Section 4.1 studies the simple case when T consists of only a single point. Section 4.2 establishes the technical sub-Gaussian increments lemmas and Section 4.3 proves Theorem 1.1 through these lemmas and Talagrand’s Majorizing Measure Theorem. Section 4.4 provides an example through scaled Bernoulli random variables that shows
4.1 Concentration of Random Vectors
Let . The isotropic and sub-Gaussian assumption on A now implies X has independent coordinates satisfying Lemma 5.3 in [20] states that
In other words, has a sub-Gaussian concentration around . It is worth noting that this concentration is independent of the ambient dimension m. We will follow a similar proof idea, but use the new inequalities (Theorem 1.3 and Theorem 1.5) to generalize and refine this result.
matrix and let be a random vector with independent sub-Gaussian coordinates satisfying . If either one of the following conditions further holds:
Proof. The conclusion is trivial if = 0, so we will assume B is non-zero.
Note that for
(This readily comes from the inequalities
Let
To bound this probability, we observe that
Combining these two bounds and then using property (b) in Appendix A complete the proof.
(b) We will first use Bernstein’s inequality to obtain (14). Denote the diagonal entries of
For random variables 1, notice that
where the -norm estimate is from property (f) in Appendix A. So by Theorem 1.3 and using the inequality
The rest of the proof is the same as in (a).
4.2 Sub-Gaussian Increments Lemma
A key lemma for Theorem 1.1 is to show that the random process sub-Gaussian increments. That is, and for all Theorem 1.3 in [20] showed sub-Gaussian increments for . Here we improve and generalize this result to any factor is in fact optimal as suggested by Proposition 4.5 in Section 4.4.
We will prove two versions of the sub-Gaussian increment lemma. The first one (Lemma 4.2) is for arbitrary B, but require the random matrix A to be mean zero. The second one (Lemma 4.3) is only for diagonal B, but does not require zero mean from A.
For Lemma 4.2 below, the beginning of the proof follows the argument in [20], except we will use Theorem 4.1 for better tail bounds. Later on in the proof, we will use a different approach to bound one of the tail probabilities (i.e. ) through the new Hanson-Wright inequality (Theorem 1.5).
be a fixed matrix and let be a mean zero, isotropic and sub-Gaussian matrix with sub-Gaussian parameter K. Then the random process
Proof. The statement is invariant under scaling for B. So without loss of generality, we will assume B has operator norm
Without loss of generality, assume
We need to bound this tail probability by a Gaussian whose standard deviation is the order of . Consider the following two cases:
• 0
Next we derive bounds for
From
Applying Theorem 4.1 to the random vector Au we get
Applying Theorem 4.1 to the random vector Ax and note that
Notice that
Let us also denote , then from
Thus we can further write Z as
where the second equality uses the fact that Z is mean zero and in the last equality
so by Theorem 1.5 we have
Hence for 0
Now we apply Equation (15) to
• For 5), we can conclude that
• For . Notice that
So far we have showed that
where ) for some absolute constant inequality min(see Appendix C), we get
Without loss of generality, we can assume be the projection of y onto unit ball, then by triangle inequality,
Here it is bounded by
where the first equality uses , the second equality is true since and the last inequality follows from Theorem 4.1. Combining these bounds we get
Finally, note that = 1, so by non-expansiveness of projection, definition of projection, . This completes the proof.
Next we show the second version of sub-Gaussian increment lemma, which requires B to be diagonal and does not need A to be mean zero. The proof is mostly the same as Lemma 4.2, so we will only highlight the differences.
be a fixed diagonal matrix and let be a isotropic, subGaussian matrix with sub-Gaussian parameter K, then the random process
Proof. If B is not a square matrix, we can always add rows of zeros to B (when l < m) or remove the last rows of zeros from B (when l > m). This will turn matrix without changing the values of . So without loss of generality, we can assume B is a square matrix. Also without loss of generality, we can further assume since the conclusion is invariant under scaling for B.
The remaining proof for Lemma 4.3 is the same as proof for Lemma 4.2 except for bounding Step 1. A bound for here can be obtained through the new Bernstein’s inequality (Theorem 1.3) as detailed below.
Recall that
where The random variables independent, with
Here we used 2, and that is isotropic. Furthermore, from property (d) in Appendix A we have
Since 0
4.3 Proof of Theorem 1.1
Theorem 1.1 follows form the sub-Gaussian increments lemmas and Talagrand’s Majorizing Measure Theorem. Let us first recall the Majorizing Measure Theorem. The following statement is from [20].
Theorem 4.4 (Majorizing Measure Theorem)be a random process on a bounded set . Assume that the process has sub-Gaussian increments, that is there exists that
The first part of Theorem 4.4 can be found in [31, Theorem 2.4.12] and the second part can be found in [7, Theorem 3.2].
Using Lemma 4.2 and Theorem 4.4 (Majorizing Measure Theorem), we get
Using property (e) in Appendix A and Lemma 4.2, we get
Since diam(), applying Lemma 4.2 and Theorem 4.4 we know that the event
holds with probability at least 1 Combining these yields the desired high probability bound.
Finally, when B is a diagonal matrix and A is not necessarily mean zero, we can repeat the above argument with Lemma 4.3 instead of Lemma 4.2. This completes the proof.
4.4 An Example for Lower Bound
Here we use scaled Bernoulli random variables to demonstrate that the factor in Theo- rem 1.1 is optimal in general.
be a random vector with independent coordinates such that
Note that the expected number of non-zero coordinates for essentially says X is non-zero in expectation, which is a mild assumption. For the proof of Proposition 4.5, we will need the following lower bound (see [26, Lemma 4.7.2]) about Binomial distributions.
Here ) is the Kullback-Leibler divergence between two Bernoulli distributions with parameters x and y respectively given by
Moreover, for 0 < y < x < 1,
follows directly from definition since
Let , with a change of variable
To show (16), we need to find a So by a change of variable , it suffices to show
Let
then
where . So by (17) we have
with := 9 log 9. Here the second inequality uses 21 on the interval of integration, and the last inequality holds because
Take
This proves (16) with
5.1 Johnson-Lindenstrauss Lemma
One immediate application of our result is a guarantee for all isotropic and sub-Gaussian matrices as Johnson-Lindenstrauss (JL) embeddings for dimension reduction. We state this JL lemma below. It follows directly form Theorem 4.1.
be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter
Proof. By scaling we can assume = 1. By Theorem 4.1 (with
the result then follows from property (a) in Appendix A.
on the example in Proposition 4.5, we can further show that (see Appendix B) the dependence on sub-Gaussian parameter K here is also optimal for small . Similar results to Lemma 5.1 have appeared in [8, 22], but to the best of our knowledge, the previous known dependence on K was
5.2 Randomized Sketches
Randomized sketching provide a method for approximating convex programs [25, 35]. In essence, a randomized sketch reduces the dimension of the original optimization problem through random projections, which can be beneficial in both computational time and memory storage. Following the problem formulation and ideas in [25], consider convex program in the form of
where is some convex set. Let be an isotropic and sub-Gaussian matrix and solve instead the convex program
This is called the ”sketched problem”. It reduce the dimension from n to m and can be viewed as an approximation to the original problem (20). Moreover, say a solution ˆx to the sketched problem (21) is -optimal to the original optimal solution
Pilanci and Wainwright [25] gave a high probability guarantee for ˆ-optimal when m is sufficient large. The following Theorem 5.2 improves the dependence on K in their guarantee from . The proof of Theorem 5.2 is also more concise thanks to the tools we have developed.
-optimal guarantee). Let A be an isotropic and sub-Gaussian matrix with subGaussian parameter
then a solution ˆx to the sketched problem as given in (21) is -optimal with probability at least 1 are absolute constants and T is the tangent cone of C at optimum
We will use an argument similar to [25] to prove Theorem 5.2. First let us state a deterministic result that says -optimality can be obtained by controlling two quantities.
Next we show a technical Lemma that will be helpful when estimating
Lemma 5.4. Let A be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter K, and let be a set with radius rad(, then there exists absolute constants C and c such that for any
. By Corollary 1.2 we have
holds with probability at least 1 . On this event,
where we use the estimate
Proof of Theorem 5.2. We wish to control the ratio in sight of Lemma 5.3. By Lemma 5.4, if
Let
triangle inequality gives
where . Applying Lemma 5.4 to
Combining the bounds for
with probability at least 1 . This completes the proof.
5.3 Favorable Landscape for Blind Demodulation with Generative Priors
In this section, we give a concrete example where the improvement on the sub-Gaussian parameter K can be important through blind demodulation with generative priors.
Blind demodulation aims to recover two signals from observation ◦ denotes componentwise multiplication. Due to the inherent nature of ambiguity of the solutions from , one usually assume that the signals come with some structure. A traditional way to model this structure is through a sparsity prior with respect to a basis such as wavelet or the Discrete Cosine Transform basis in case the signals are images.
On the other hand, with recent development in deep learning, the generative adversarial network (GAN) is turning out to be very effective in generating realistic synthetic images, which naturally indicates that we may model a certain type of image signals as outputs of GAN. Especially in the inverse problems like compressed sensing, phase retrieval including this blind demodulation, practitioners have observed an order of magnitude sample (observation) complexity improvement over the sparsity prior [5,12,21].
This alternative model is called the generative prior and as a consequence is becoming a new promising model for modern signal processing [5, 10–12]. In Hand and Joshi [10], the authors provide a global landscape guarantee for blind demodulation problem with generative priors and they applied our Bernstein’s inequality in their proof.
With generative priors, unknown signals are assumed to be in the range of two generative neural networks respectively. More precisely, -layer network, -layer network and they can be written as
where relu is the Rectified Linear Unit activation function given by relu(x) = max{x, 0} and are weight matrices.
The weight matrices are normally obtained in the training process of the networks but the empirical evidence in [4] suggests that they behave a “random-like” quantity . Based on this phenomenon, the authors of [10] made the following additional assumptions on the networks and to facilitate analysis further:
The signals can then be recovered by finding their latent codes ). This leads to the following empirical risk minimization program:
Note that there is a scaling ambiguity in this problem since it does not distinguish points on curve for any given (h, m), thus one can only hope to find the solution curve
. The authors in [10] showed that under assumptions A1-A3, two conditions that are called the Weight Distributed Condition (WDC) and the joint-WDC are met. These conditions guarantee a favorable landscape for the objective function f(h, m), namely f has a descent direction at all points outside of a small neighborhood of four curves containing the solution. One of the important ingredients in their proof is concentration bounds for singular values of random matrices. When they showed that the joint-WDC condition is satisfied by concentration argument, they were able to improve the requirement in assumption A3 from, up to log factors, . Such improvement is made possible by our new Bernstein’s inequality with refined sub-exponential parameter dependence. This sample complexity matches the one in the previous recovery guarantees with sparsity prior (in which case n and p denotes the sparsity levels), but potentially better since the latent code dimension is oftentimes smaller than a sparsity level with respect to a particular basis. See Theorem 2, Theorem 5, Lemma 8, and Lemma 9 in [10] for more details.
In this article, we proved the optimal concentration bound for sub-Gaussian random matrices on sets. Namely, with high probability,
where is an arbitrary matrix, is an (mean zero) isotropic and sub-Gaussian random matrix, is the set, K is the sub-Gaussian parameter of A, sr(B) is the stable rank of B, w(T) is the gaussian width of . Compared to the previous work in [20], this result generalizes by allowing an arbitrary matrix B while improves the dependency on the sub-Gaussian parameter from to the optimal . Consequently, this can lead to a tighter concentration bound even in the cases where the sub-Gaussian matrix BA have correlated rows. It is also worth noting that dependence on w(T) + rad(T) is optimal in general as well.
We also proved, under extra moment conditions, a new Bernstein type inequality and a new Hanson-Wright inequality. The extra conditions here are bounded first absolute moment (e.g. 2) for Bernstein’s inequality and bounded second moment (e.g. = 1) for Hanson- Wright inequality. In many cases, these conditions can be easily met – for example, they are implied by the isotropic condition of random variables or vectors. In general, both of our new inequalities give improved tail bounds in the sub-Gaussian regime, which is the regime of interest in many applications as demonstrated in Section 5.
Y. Plan is partially supported by an NSERC Discovery Grant (22R23068), a PIMS CRG 33: HighDimensional Data Analysis, and a Tier II Canada Research Chair in Data Science. ¨O. Yılmaz is partially supported by an NSERC Discovery Grant (22R82411) and PIMS CRG 33: HighDimensional Data Analysis. H. Jeong is funded in part by the University of British Columbia Data Science Institute (UBC DSI) and by the Pacific Institute of Mathematical Sciences (PIMS). The authors of this paper would also like to thank Babhru Joshi for the helpful discussions and providing us with an important application in Section 5.
[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
[2] R. Adamczak. Logarithmic sobolev inequalities and concentration of measure for convex functions and polynomial chaoses. Bulletin of the Polish Academy of Sciences Mathematics, 53(2):221–238, 2005.
[3] R. Adamczak. A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.
[4] S. Arora, Y. Liang, and T. Ma. Why are deep nets reversible: A simple theory, with implica- tions for training. arXiv preprint arXiv:1511.05653, 2015.
[5] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 537–546. JMLR. org, 2017.
[6] E. J. Cand`es. Mathematics of sparsity (and a few other things). In Proceedings of the International Congress of Mathematicians, Seoul, South Korea, volume 123, 2014.
[7] S. Dirksen. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.
[8] S. Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Foundations of Computational Mathematics, 16(5):1367–1396, 2016.
[9] Y. Gordon. On milman’s inequality and random subspaces which escape through a mesh in , pages 84–106. Springer, 1988.
[10] P. Hand and B. Joshi. Global guarantees for blind demodulation with generative priors. In Advances in Neural Information Processing Systems, pages 11531–11541, 2019.
[11] P. Hand, O. Leong, and V. Voroninski. Phase retrieval under a generative prior. In Advances in Neural Information Processing Systems, pages 9136–9146, 2018.
[12] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. In Proceedings of Machine Learning Research, volume 75, pages 1–8, 2018.
[13] P. Hitczenko, S. Kwapie´n, W. Li, G. Schechtman, T. Schlumprecht, and J. Zinn. Hypercontractivity and comparison of moments of iterated maxima and minima of independent random variables. Electronic Journal of Probability, 3, 1998.
[14] D. M. Kane and J. Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014.
[15] B. Klartag and S. Mendelson. Empirical processes and random projections. Journal of Functional Analysis, 225(1):229, 2005.
[16] Y. Klochkov and N. Zhivotovskiy. Uniform hanson-wright type concentration inequalities for unbounded entries via the entropy method. arXiv preprint arXiv:1812.03548, 2018.
[17] F. Krahmer, D. Needell, and R. Ward. Compressive sensing with redundant dictionaries and structured measurements. SIAM Journal on Mathematical Analysis, 47(6):4606–4629, 2015.
[18] F. Krahmer and H. Rauhut. Structured random measurements in signal processing. GAMMMitteilungen, 37(2):217–238, 2014.
[19] K. G. Larsen and J. Nelson. The johnson-lindenstrauss lemma is optimal for linear dimensionality reduction. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 82:1–82:11, 2016.
[20] C. Liaw, A. Mehrabian, Y. Plan, and R. Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric Aspects of Functional Analysis, pages 277–299. Springer, 2017.
[21] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos. Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Processing Magazine, 35(1):20–36, 2018.
[22] J. Matouˇsek. On variants of the johnson–lindenstrauss lemma. Random Structures & Algorithms, 33(2):142–156, 2008.
[23] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17(4):1248–1282, 2007.
[24] S. Oymak and J. A. Tropp. Universality laws for randomized dimension reduction, with applications. Information and Inference: A Journal of the IMA, 7(3):337–446, 2017.
[25] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guar- antees. IEEE Transactions on Information Theory, 61(9):5096–5115, 2015.
[26] B. Robert. Ash. Information Theory. Dover Publications Inc., 1990.
[27] M. Rudelson and R. Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013.
[28] R. Saab, R. Wang, and ¨O. Yılmaz. From compressed sensing to compressed bit-streams: practical encoders, tractable decoders. IEEE Transactions on Information Theory, 64(9):6098– 6114, 2018.
[29] P.-M. Samson. Concentration of measure inequalities for Markov chains and Φ-mixing pro- cesses. The Annals of Probability, 28(1):416–461, 2000.
[30] G. Schechtman. Two observations regarding embedding subsets of euclidean spaces in normed spaces. Advances in Mathematics, 200(1):125–135, 2006.
[31] M. Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014.
[32] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
[33] V. Vu and K. Wang. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures & Algorithms, 47(4):792–821, 2015.
[34] D. P. Woodruff. Sketching as a tool for numerical linear algebra. in Theoretical Computer Science, 10(1–2):1–157, 2014.
[35] Y. Yang, M. Pilanci, M. J. Wainwright, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017.
(d) . In particular,
In particular, properties (a) and (b) implies that a random variable is sub-Gaussian (or sub-exponential) if and only if its tail probability is bounded by a Gaussian (or exponential) random variable. Properties (c) and (d) tell us if X and Y are both sub-Gaussian, then are all sub-exponential. Property (e) tells us for -th moments of X exist whenever is finite. Property (f) tells us we can always center random variables without changing their -norm up to a constant factor. Property (g) tells us all sub-Gaussian random variables are also sub-exponential random variables.
Since
(c) This follows from definition.
Applying Young’s inequality again we have
This shows
(e) Without loss of generality, we can assume = 1. Then by property (a),
With a change of variable
where Γ() denotes the Gamma function. Note that for s > 0,
where we used the fact that attains maximum at x = 2s because
Therefore
Using property (d) and the fact that
This completes the proof with
(g) Without loss of generality, we can assume = 1. Then by property (a),
Therefore by property (b) we have
Here we give an example to demonstrate the dependence for sample complexity in the JL Lemma (Lemma 5.1) is optimal for small . This example is virtually the same as the one in Proposition 4.5, however, this result is not implied by Proposition 4.5 as the latter does not guarantee such dependence on K is optimal when
Proposition B.1. Suppose random matrix has symmetric i.i.d. entries that and denote K the positive number such that . Also denote . If for any fixed , the probability bound
(a) is an isotropic and sub-Gaussian matrix with sub-Gaussian parameter being no more than K.
Proof. Recall that , so part (a) is straightforward to verify. Now we proceed with proof for part (b). Notice that
Since
we get
Therefore
Here we list and prove the non-standard inequalities used in our proofs.
Taking exponential we get