b

DiscoverSearch
About
My stuff
Sub-Gaussian Matrices on Sets: Optimal Tail Dependence and Applications
2020·arXiv
Abstract
Abstract

Random linear mappings are widely used in modern signal processing, compressed sensing and machine learning. These mappings may be used to embed the data into a significantly lower dimension while at the same time preserving useful information. This is done by approximately preserving the distances between data points, which are assumed to belong to  Rn. Thus, the performance of these mappings is usually captured by how close they are to an isometry on the data. Random Gaussian linear mappings have been the object of much study, while the sub-Gaussian settings is not yet fully understood. In the latter case, the performance depends on the sub-Gaussian norm of the rows. In many applications, e.g., compressed sensing, this norm may be large, or even growing with dimension, and thus it is important to characterize this dependence.

We study when a sub-Gaussian matrix can become a near isometry on a set, show that previous best known dependence on the sub-Gaussian norm was sub-optimal, and present the optimal dependence. Our result not only answers a remaining question posed by Liaw, Mehrabian, Plan and Vershynin in 2017, but also generalizes their work. We also develop a new Bernstein type inequality for sub-exponential random variables, and a new Hanson-Wright inequality for quadratic forms of sub-Gaussian random variables, in both cases improving the bounds in the sub-Gaussian regime under moment constraints. Finally, we illustrate popular applications such as Johnson-Lindenstrauss embeddings, randomized sketches and blind demodulation, whose theoretical guarantees can be improved by our results in the sub-Gaussian case.

Keywords: compressed sensing, dimension reduction, random matrices, sub-Gaussian, Bernstein’s inequality, Hanson-Wright inequality, blind demodulation

Random linear mappings play a central role in dimension reduction, compressed sensing, and numerical linear algebra due to their propensity to preserve the geometry of a given set. The performance of a random linear mapping  A ∈ Rm×n is often determined by the uniform concentration bound of

√m∥Ax∥2 around ∥x∥2for all vectors in a set of interest (in other words, how close the map 1√mAis to being an isometry on the set). This is now well-understood by the standard techniques in the Gaussian random matrix case [9,30,32]. However, in many applications, non-gaussian random mappings are more useful because of their computational/storage benefits or simply the difficulty to generate Gaussian matrices using sampling devices [17]. For example, sparse or structured random matrices are preferred in both dimension reduction [8] and random sketching in numerical linear algebra [1,14,25,34] since they provide more efficient matrix multiplications than dense and unstructured matrices such as Gaussian ones. Certain formulations in compressed sensing also naturally require random matrices such as randomly subsampled Fourier measurements [18] or Bernoulli random matrices [28].

There has been a series of recent works [8, 20, 24] to demonstrate the effectiveness of random mappings outside the Gaussian setup. Unlike the Gaussian case in which we have a rotation invariance property, non-Gaussian setups require more sophisticated arguments to address various new technical challenges. In this article, we will be focusing on sub-Gaussian random mappings.

Let us recall some definitions. For  α ≥ 1, the ψα-norm (which is the Orlicz norm taken with respect to function exp(xα) −1) of a random variable X is defined as

image

In particular,  α= 2 gives the sub-Gaussian norm and  α= 1 gives the sub-exponential norm. The random variable X is called sub-Gaussian if  ∥X∥ψ2 < ∞and called sub-exponential if  ∥X∥ψ1 < ∞.

For sub-Gaussian random variables, the  ψ2-norm roughly measures how fast the tail distribution decays – usually the bigger  ψ2-norm is, the heavier the tail. We will repeatedly use the fact that ∥X∥ψ2 ≤ Kif and only if the tail probability  P(|X| ≥ t) is bounded by a Gaussian with standard deviation in the order of K. A precise statement of this, along with some other properties of ψα-norm, can be found in Appendix A.

The sub-Gaussian norms for many random variables can be calculated by looking at the moment generating function of their squares. For example, the sub-Gaussian norm for  Normal(0, σ2) is�83 σ; for Bernoulli(p) it is log− 12 �1 + p−1�; for Rademacher random variable it is log− 12(2) and

for any bounded (by M) random variable it is no more than  M log− 12 (2). For Exponential(λ), itis not a sub-Gaussian random variable, but has sub-exponential norm 2λ.For a random vector  a ∈ Rn we say ais sub-Gaussian if

image

We say a random matrix  A ∈ Rm×n is isotropic and sub-Gaussian if its rows are independent, isotropic and sub-Gaussian random vectors in  Rn. The sub-Gaussian parameter of A is defined as

image

For random matrix  A ∈ Rm×n, the isotropic condition guarantees 1√mAwill preserve Euclidean norm in expectation. Some examples of isotropic and sub-Gaussian matrices are matrices whose entries  Aijare independent and sub-Gaussian with  EA2ij = 1, uniformly subsampled (with replace- ment and after proper normalization) rows of orthonormal basis or tight frames, etc. [32]. In the cases of Bernoulli matrices or sparse ternary matrices, which is a generalization of the database-friendly mappings in [1], the sub-Gaussian parameter can depend on the signal dimension n if the probability of an entry being nonzero is n-dependent.

In the line of research regarding sub-Gaussian random mappings, Liaw et al. [20] showed that for isotropic and sub-Gaussian mapping A with sub-Gaussian parameter  K, let T ⊂ Rn, then wehave with high probability,

image

Here w(T) is the Gaussian width given by

image

which is the radius when T is symmetric.

Gaussian width measures the complexity of a set. In particular, denote cone(T) := {tx : t ≥0, x ∈ T}, then w2(cone(T) ∩ Sn−1) is a meaningful approximation for dimension [6,24]. Generally rad(T) is also dominated by w(T). For example, if 0  ∈ T, then by Jensen’s inequality,

image

In such case, (1) implies that with high probability, 1√mAis a near isometry on T whenever m ≥ CK4w2(T) for some constant C.

The dependency on w(T) in (1) is optimal. This is easy to see when m = 1 and A has i.i.d. Normal(0, 1) entries. But when it comes to the dependency on the sub-Gaussian parameter K, whether the  K2 factor can be improved is a question raised but left unanswered in [20]. Other important works regarding this type of bounds are either not explicit [15,24] or at least of the same order  K2 [7,8,23].

In this article, we refine this dependency on the sub-Gaussian parameter from  K2 to the optimal K√log K. This enhances the concentration bound substantially when the sub-Gaussian mapping is not well-behaved, for example, when K increases together with the signal dimension. We also relax the row-independent requirement by considering random mappings in the form of BA where B is an arbitrary matrix and A is mean zero, isotropic and sub-Gaussian. The mean zero assumption is additional when comparing to the assumptions for (1), and not needed when B is only diagonal. However, it is necessary for arbitrary B. Our bound is broadly applicable since it only require these properties from the random matrix A without any other assumptions.

Now we state our main theorem. In the following,  ∥B∥F and ∥B∥denote Frobenius and operator norm of B respectively. The matrix  B ∈ Rl×m is diagonal means that the only possible non-zero entries are  Bii where 1 ≤ i ≤ min{l, m}.

Theorem 1.1. Let B ∈ Rl×m be a fixed matrix, let  A ∈ Rm×n be a mean zero, isotropic and sub-Gaussian matrix with sub-Gaussian parameter  K and let T ⊂ Rn be a bounded set. Then

image

image

Here C is an absolute constant. Furthermore, when B is a diagonal matrix, random matrix A only need to be isotropic and sub-Gaussian with sub-Gaussian parameter K for the conclusions to hold.

When B is the identity matrix, we have the following corollary.

Corollary 1.2. Let A ∈ Rm×n be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter  K, and let T ⊂ Rn be a bounded set. Then

image

The bound appearing on the right hand side of Theorem 1.1 is optimal in general. The  ∥B∥factor is optimal and this is easy to see when B has non-zero singular values being all equal (because the statement should be invariant under scaling for B). We also give another example below in which the singular values are not all equal. The dependency on K is optimal and this follows from Proposition 4.5 in Section 4.4 with T being a singleton. As mentioned before, rad(T) is generally dominated by w(T) and the dependency on w(T) is also optimal.

Assuming rad(T) is dominated by w(T), Theorem 1.1 then implies that with high probability, matrix 1∥B∥F BAis a near isometry on T whenever the stable rank of B

image

This result recovers (1) with improved dependency on  K when B = Im.

image

Bij = 1 for 1 ≤ i ≤ l and 1 ≤ j ≤ m, then ∥B∥F = ∥B∥ =√lm. Suppose Ahas i.i.d. entries where

image

It is easy to verify that  EA ̸= 0 and Ahas isotropic rows  ATi . Moreover, for any  y = (y1, . . . , yn) ∈Sn−1, notice that

image

On the other hand,  ∥B∥ [w(T) + rad(T)] =√lm. So in this case, Theorem 1.1 does not hold when m is sufficiently large.

As an example demonstrating  ∥B∥is optimal in general, consider the case when  T = {x} ⊂ Sn−1,A is standard gaussian so that  g := Ax ∼ Normal(0, Im) and B = diag(τ, 1, . . . , 1) where τ > 0.Also let  gibe the coordinates of g, then

image

where we used Jensen’s inequality in the second last line. This estimate is in the order of  τ = ∥B∥when  τ > C√mwith some constant C large enough.

image

isotropic and sub-Gaussian conditions of A guarantee that K is bounded below from 1. To see this, let  X := Ax for some x ∈ Sn−1, then Xhas independent coordinates  Xisatisfying  EX2i = 1 and

image

we can conclude that  K ≥ K0and the equality is achieved when  Xi = 1 a.s.

The proof for Theorem 1.1 follows an analogous approach in Liaw et al. [20]. One major difference is that we prove and apply two new concentration inequalities with improved parametric dependency in the sub-Gaussian regime. We believe these inequalities are interesting on their own as an application-oriented concentration inequality.

The first one is a new Bernstein type inequality under bounded first absolute moment condition. This inequality provides a concentration bound for sum of sub-exponential random variables.

Theorem 1.3 (New Bernstein’s Inequality). Let a = (a1, . . . , am)be a fixed non-zero vector and let  Y1, . . . , Ymbe independent, mean zero sub-exponential random variables satisfying  E|Yi| ≤ 2 and∥Yi∥ψ1 ≤ K2i �assume Ki ≥ 65�. Then for every  t ≥ 0 we have

image

Remark 1.4. Theorem 1.3 remains true (with a different absolute constant c) when the 2 in E|Yi| ≤ 2is replaced with an arbitrary positive constant (see Remark 2.2 for more detail).

The second one is a new Hanson-Wright inequality under unit variance condition. This inequality provides a concentration bound for quadratic forms of independent random variables and is more general than the aforementioned Bernstein’s inequality. In the literature, results of similar flavor have been obtained [3,16,33] but under different assumptions. We will give a brief comparison between our result and a few notable ones in Section 3.

Theorem 1.5 (New Hanson-Wright Inequality). Let A ∈ Rn×n be a fixed non-zero matrix and let X = (X1, . . . , Xn) ∈ Rn be a random vector with independent, mean zero, sub-Gaussian coordinates satisfying  EX2i = 1 and ∥Xi∥ψ2 ≤ K. Then for every  t ≥ 0 we have

image

Remark 1.6. If A is a diagonal matrix, then Theorem 1.5 recovers Theorem 1.3 (assuming all Kiare equal) with  Yi = X2i − EX2i . Therefore this can be viewed as a generalization of the new Bernstein’s inequality given in Theorem 1.3.

image

We use  ∥ · ∥2for Euclidean norm of vectors,  ∥ · ∥F and ∥ · ∥for Frobenius and operator norm of matrices respectively. We use  ◦for Hadamard (entrywise) product. We say  f ≲ g if f ≤ Cg forsome absolute constant  C and say f ≳ g if f ≥ Cgfor some absolute constant C. Typically, c and C denote absolute constants (often c for small ones and C for large ones) which may vary from line to line.

image

The rest of this paper is organized as follows: In Section 2, we discuss and prove the new Bernstein’s inequality (Theorem 1.3). In Section 3, we first discuss and compare the new Hanson-Wright inequality (Theorem 1.5) to other known variants of Hanson-Wright inequalities and then prove Theorem 1.5. In Section 4, we prove our main theorem regarding sub-Gaussian matrices on sets (Theorem 1.1) and show that our dependency on K is optimal. In Section 5, we demonstrate how our result can improve theoretical guarantees of some popular applications such as JohnsonLindenstrauss embedding, randomized sketches and blind demodulation. In Section 6, we make a brief conclusion for this paper.

In this section we prove the new Bernstein’s inequality Theorem 1.3. Let us first recall the standard Bernstein’s inequality for sub-exponential random variables [32, Theorem 2.8.2], which states that for independent, mean zero, sub-exponential random variables  Y1, Y2, . . . , Ymand a vector a = (a1, . . . , am) ∈ Rm, we have

image

where  K2 = maxi ∥Yi∥ψ1.

Compared to (2), Theorem 1.3 has an extra assumption on the first absolute moment of  Yi –namely  E|Yi| ≤2, but it improves the dependence on K in the sub-Gaussian regime from  K4 toK2 log K. It is worth noting that such extra assumption comes naturally when considering isotropic random matrices/vectors. In fact, let x be a fixed point on the unit sphere and let  aibe isotropic random vectors of the same dimension, then  Yi := | ⟨ai, x⟩ |2 −1 is mean zero since  aiis isotropic, and  E|Yi| ≤ E| ⟨ai, x⟩ |2 + 1 = 2 by triangle inequality.

image

We will first bound the moments of  Yi, then bound their moment generating functions, and finally use Chernoff method to obtain the desired tail bound.

image

The idea here is to write the moment as an integral and then estimate under the two constraints E|Yi| ≤ 2 and ∥Yi∥ψ1 ≤ K2.

Lemma 2.1 (Moment Bounds). Let Y be a sub-exponential random variable satisfying  E|Y | ≤ 2and  ∥Y ∥ψ1 ≤ K2 with K ≥ 65. Then

image

Proof. Define f(t) := P(|Y | ≥ t) et/K2. Since E|Y | ≤2, we have

image

Also, since  ∥Y ∥ψ1 ≤ K2, a change of variable  s = et/K2 gives

image

For the p-th moment of |Y |, with a change of variable  s = up, we have

image

We will split this integral into two parts. Set  T = 6pK2 log K. Since pup−1 monotonically increases on [0, T], we have

image

On the other hand, since

image

Combining these two parts completes the proof with  C ≤ 6.

image

Let Y be the random variable as in Lemma 2.1, the moment generating function of Y can be estimated through Taylor series

image

Here the first inequality is by Lemma 2.1 (with  C1 ≤6) and the second inequality uses  p! ≥ (p/e)p.When  |λ|K2 log K ≤ 1/(2C1e), the above summation converges and we have

image

for absolute constants  C0 = (C1e)2 and c = 1/(2C1e).

image

where  c0 and C0are absolute constants. When we minimize the above expression over  λ, we getthe optimal value

image

Next we plug in  λoptinto (6) to get

image

The bound for  P (� aiYi < −u) is similarly obtained by considering  −Yiinstead of  Yi. This com- pletes the proof.

Remark 2.2. If the random variables  Yihave first absolute moment  E|Yi| ≤ α, then the right hand side of Equation (3) becomes  αand it is easy to see that Lemma 2.1 still holds with  C ≤ 6 + α. Itfollows that the  C1in Step 2 will be no more than 6 +  αand Theorem 1.3 now holds with constant c = 14(C1e)2 ≥ 14e2(6+α)2 .

Hanson-Wright inequality gives a concentration bound for quadratic forms of random variables. The version in [27] states that for a random vector  X = (X1, . . . , Xn) ∈ Rn with independent, mean zero, sub-Gaussian coordinates, suppose maxi ∥Xi∥ψ2 ≤ K and let A be an n×nreal matrix, then

image

In the same spirit as the new Bernstein’s inequality, we can improve the tail dependency on K in the sub-Gaussian regime from  K4 to K2 log Kunder a further assumption  EX2i = 1 for each  Xi.This is the new Hanson-Wright inequality Theorem 1.5.

It is not difficult to drop the requirement  EX2i = 1 in Theorem 1.5 by a simple scaling, in which case we have the following corollary.

Corollary 3.1. Let X = (X1, . . . , Xn) ∈ Rn be a random vector with independent, mean zero, sub-Gaussian coordinates satisfying 0  < ∥Xi∥ψ2 ≤ K, then for fixed square matrix A,

image

where α1 = mini�EX2i� 12 , α2 = maxi�EX2i� 12 and γ = α2/α1.

Proof. Let βi := (EX2i )12 for 1 ≤ i ≤ nand define diagonal matrices

image

Then ˜X := D1/βXsatisfies the assumption of Theorem 1.5 with  E ˜X2i = 1 and ∥ ˜Xi∥ψ2 ≤ K/βi ≤K/α1. Applying Theorem 1.5 to ˜X and ˜A := DβADβcompletes the proof.

image

Let us first compare Corollary 3.1 to the standard Hanson-Wright inequality (8) in the case when γ= 1. The concentration bound in (8) implies that, with probability at least 1  − 2e−t,

image

Meanwhile, Corollary 3.1 implies that

image

where  α = (EXi)12. Note that  α ≤ ∥Xi∥ψ2 ≤ K, so this bound improves the parameter dependence (up to a log factor) in the sub-Gaussian regime from  K2 to αK. Such improvement can be significant when  αis far less than K.

Other variants of Hanson-Wright inequality have appeared in literature with similar improvements [3,33]. In particular, one of the results by Adamczak [3] works under the assumption that X satisfies the convex concentration property with constant ˜K, that is, for every 1-Lipschitz convex function  ϕ : Rn → R, we always have  E|ϕ(X)| < ∞ and

image

Then under such assumption,

image

where Cov(X) is the covariance matrix of X. When X has independent and mean zero coordinates, ∥Cov(X)∥ = maxi EX2i .However, the convex concentration property is not the same as subGaussianity. More precisely, while it is true that ˜K is independent of dimension when X has i.i.d. coordinates which are bounded almost surely [29], this can fail when the boundedness assumption of  Xiis replaced by sub-Gaussianity (i.e. ˜K could depend on the dimension of  X when Xiare i.i.d. and sub-Gaussian) [2,13]. Therefore the bound in (11) does not imply (10) nor (9) in general.

In a more recent paper by Klochkov and Zhivotovskiy [16], the authors proved a uniform version of the Hanson-Wright inequality which, when applying to a single matrix under the same assumption as Corollary 3.1, yields the following bound:

image

where  M = ∥ maxi |Xi|∥ψ2. This bound also improves (9) in some cases as demonstrated in [16]. We shall compare this bound to (10) in the sub-Gaussian regime. On one hand, Jensen’s inequality tells us that the  E∥AX∥2factor in (12) is bounded by the  α∥A∥Ffactor in (10). On the other, the factor M in (12) is only bounded by  M ≲ K√log n, which could depend on dimension n. Moreover, (12) only provides a one-sided bound instead of two-sided concentration bounds like Equations (9) to (11).

image

The main idea of proof is similar to [27], that is to divide the sum into diagonal and off-diagonal, then bound the moment generating function of the latter through a decoupling and comparison argument. However, there are two significant differences. The first difference is the random variables used for comparison. We will use scaled Bernoulli multiplied by standard Gaussian in order to preserve the condition of second moment being 1. Using such random variables also leads to challenges in bounding the moment generating function, which is the second difference.

Now we proceed with the proof. For any t > 0, let

image

be the the tail probability we want to bound. Let  A1 := diag(A) be the diagonal of A and let

image

We will seek bounds for  p1 and p2.

image

The bound for  p1is given by our new Bernstein’s inequality. Notice that

image

where  E|X2i − 1| ≤ 2 and ∥X2i − 1∥ψ1 ≤ C∥X2i ∥ψ1 ≤ CK2. So by Theorem 1.3 and the simple relationships between the norms of  A1 and A, we have

image

To bound  p2, we will derive a bound for the moment generating function of  XT A2X. Let X′

be an independent copy of X, then

image

The above follows directly from the following decoupling lemma.

Lemma 3.2 (Decoupling [32]). Let A = (aij) be a fixed n × nmatrix, and let  X = (X1, . . . , Xn) ∈Rn be a random vector with independent mean zero coordinates. Then for every convex function

image

See Theorem 6.1.1 and Remark 6.1.3 in [32] for a proof of Lemma 3.2.

image

We will compare  X (and X′) to scaled Bernoulli multiplied entrywise by standard Gaussian. But first let us look at the case of a single variable through the following lemma.

image

Proof. Using the inequality  ex ≤ x + cosh(2x), which is true for all  x ∈ R(see Appendix C), we have

image

By Lemma 2.1 we know that  EZ2q ≤ Cq0qqL2q−2for any positive integer q and some absolute constant  C0, hence

image

On the other hand, a direct calculation gives

image

Choosing any  C such that C2 ≥ 8C0completes the proof.

Now let  g, r ∈ Rn be random vectors such that  g ∼ Normal(0, In) and rhas entries  r2i i.i.d∼L2 · Bernoulli(L−2) where L2 = K2 log K. Also let g′ and r′ be independent copies of g and r. Let  αbe any vector in  Rn, by Lemma 3.3 and independence we have

image

Note the above also holds for  EX′ exp(αT X′). Therefore

image

where  R := diag(r) and R′ := diag(r′). Here the two inequalities are repeated applications of Equation (13).

image

Denote  σi = σi(RAR′) the singular values of matrix  RAR′. From the rotation invariance of g and  g′ we have

image

For standard normal random variables  gi and g′i,

image

where the inequality uses (1  − x)− 12 ≤ ex when x ∈ [0, 12) (see Appendix C).

Also note that  σi ≤ ∥RAR′∥ ≤ L2∥A∥, so if λ2 < 12L4∥A∥2 we have

image

Next, use the following Lemma 3.4 (with  η = λ2L4 and p = L−2) to bound the moment generating function of  ∥RAR′∥2F and we obtain

image

Lemma 3.4. Let D be a diagonal random matrix with i.i.d. entries  Dii = di ∼ Bernoulli(p), andlet  D′ be an independent copy of D. Given a fixed matrix A, then

image

Proof. Denote Ai the i-th row of A. Notice that

image

so for  η ∈�0, 1∥A∥2�, we have

image

Here the second last inequality uses  η∥Ai∥22 ≤ η∥A∥2 ≤ 1 and ex ≤ 1 + 2x when x ∈ [0,1]. The last inequality uses 1 +  x ≤ ex.

image

From previous steps we get

image

Optimizing this over  λ(similar to proof of Theorem 1.3) yields a one sided bound for  p2. The other side can then be obtained by considering  −A2 (and −A) instead of  A2. Together they give

image

Lastly, since  p ≤ min{1, p1 + p2}, combining the bounds for  p1, p2and then applying inequality min{1, 4e−x} ≤ 2e−x/2 (see Appendix C) complete the proof of Theorem 1.5.

In this section we prove Theorem 1.1 and show that the  K√log Kdependence on K is optimal. Section 4.1 studies the simple case when T consists of only a single point. Section 4.2 establishes the technical sub-Gaussian increments lemmas and Section 4.3 proves Theorem 1.1 through these lemmas and Talagrand’s Majorizing Measure Theorem. Section 4.4 provides an example through scaled Bernoulli random variables that shows  K√log K is tight.

4.1 Concentration of Random Vectors

Let  X := Ax ∈ Rm with x ∈ Sn−1. The isotropic and sub-Gaussian assumption on A now implies X has independent coordinates satisfying  EX2i = 1 and ∥Xi∥ψ2 ≤ K.Lemma 5.3 in [20] states that

image

In other words,  ∥Ax∥2has a sub-Gaussian concentration around √m. It is worth noting that this concentration is independent of the ambient dimension m. We will follow a similar proof idea, but use the new inequalities (Theorem 1.3 and Theorem 1.5) to generalize and refine this result.

Theorem 4.1. Let B be a fixed m × nmatrix and let  X = (X1, . . . , Xn) ∈ Rn be a random vector with independent sub-Gaussian coordinates satisfying  EX2i = 1 and ∥Xi∥ψ2 ≤ K. If either one of the following conditions further holds:

image

image

Proof. The conclusion is trivial if  ∥B∥= 0, so we will assume B is non-zero.

image

Note that for  α, β, s ≥ 0,

image

(This readily comes from the inequalities  |α2 − β2| ≥ |α − β|2 and |α2 − β2| ≥ |α − β|β whenever

image

Let  Z := ∥BX∥2 − ∥B∥F , then

image

To bound this probability, we observe that

image

Combining these two bounds and then using property (b) in Appendix A complete the proof.

(b) We will first use Bernstein’s inequality to obtain (14). Denote  bi := Biithe diagonal entries of

image

For random variables  X2i −1, notice that

image

where the  ψ1-norm estimate is from property (f) in Appendix A. So by Theorem 1.3 and using the inequality � b4i ≤�maxi b2i�· � b2i = ∥B∥2∥B∥2F , we have

image

The rest of the proof is the same as in (a).

4.2 Sub-Gaussian Increments Lemma

A key lemma for Theorem 1.1 is to show that the random process  Zx := ∥BAx∥2 − ∥B∥F ∥x∥2 hassub-Gaussian increments. That is,  ∥Zx − Zy∥ψ2 ≤ M∥x − y∥2 for some Mand for all  x, y ∈ Rn.Theorem 1.3 in [20] showed sub-Gaussian increments for  B = Im with M = CK2. Here we improve and generalize this result to any  B with M = CK√log K ∥B∥. The K√log Kfactor is in fact optimal as suggested by Proposition 4.5 in Section 4.4.

We will prove two versions of the sub-Gaussian increment lemma. The first one (Lemma 4.2) is for arbitrary B, but require the random matrix A to be mean zero. The second one (Lemma 4.3) is only for diagonal B, but does not require zero mean from A.

For Lemma 4.2 below, the beginning of the proof follows the argument in [20], except we will use Theorem 4.1 for better tail bounds. Later on in the proof, we will use a different approach to bound one of the tail probabilities (i.e.  p3) through the new Hanson-Wright inequality (Theorem 1.5).

Lemma 4.2. Let B ∈ Rl×m be a fixed matrix and let  A ∈ Rm×n be a mean zero, isotropic and sub-Gaussian matrix with sub-Gaussian parameter K. Then the random process

image

Proof. The statement is invariant under scaling for B. So without loss of generality, we will assume B has operator norm  ∥B∥ = 1.

image

Without loss of generality, assume  x ̸= y and define

image

We need to bound this tail probability by a Gaussian whose standard deviation is the order of K√log K. Consider the following two cases:

image

0  < s < 2∥B∥F . Write p as

image

Next we derive bounds for  p1, p2 and p3.

image

From  s ≥ 2∥B∥F we have

image

Applying Theorem 4.1 to the random vector Au we get

image

Applying Theorem 4.1 to the random vector Ax and note that  ∥B∥F > 12s, we get

image

Notice that

image

Let us also denote  Xw := Aw for w ∈ Rn, then from  EXwXTw = ∥w∥22 In we have

image

Thus we can further write Z as

image

where the second equality uses the fact that Z is mean zero and in the last equality  Yw := ∥BXw∥22−

image

so by Theorem 1.5 we have

image

Hence for 0  ≤ t ≤ ∥w∥22∥B∥2F ,

image

Now we apply Equation (15) to  p4, p5 and p6.

For  p4. Since s < 2∥B∥F and ∥u + v∥2 =�1 + ∥v∥22 ∈ [1,√5), we can conclude that

image

For  p5. Notice that  ∥u∥2 = 1 and 1 − 18∥v∥22 ∈ (12, 1], so

image

image

So far we have showed that

image

where  pi ≤ 2 exp( −cs2K2 log K ) for some absolute constant  c and 1 ≤ i ≤ 6. Note p ≤ 1 and theinequality min{1, 8e−x} ≤ 2e−x/3 (see Appendix C), we get

image

Without loss of generality, we can assume  ∥x∥1 = 1 and ∥y∥2 ≥ 1. Let ¯y := y∥y∥2 be the projection of y onto unit ball, then by triangle inequality,

image

Here  R1it is bounded by  CK√log K∥x − ¯y∥2 since x, ¯y ∈ Sn−1, and

image

where the first equality uses  Zy = ∥y∥2Z¯y, the second equality is true since  ∥y∥2 − 1 = ∥y − ¯y∥2and the last inequality follows from Theorem 4.1. Combining these bounds we get

image

Finally, note that  ∥x∥2= 1, so by non-expansiveness of projection,  ∥x − ¯y∥2 ≤ ∥x − y∥2, and bydefinition of projection,  ∥y − ¯y∥2 ≤ ∥y − x∥2. This completes the proof.

Next we show the second version of sub-Gaussian increment lemma, which requires B to be diagonal and does not need A to be mean zero. The proof is mostly the same as Lemma 4.2, so we will only highlight the differences.

Lemma 4.3. Let B ∈ Rl×m be a fixed diagonal matrix and let  A ∈ Rm×n be a isotropic, subGaussian matrix with sub-Gaussian parameter K, then the random process

image

Proof. If B is not a square matrix, we can always add  m − lrows of zeros to B (when l < m) or remove the last  l − mrows of zeros from B (when l > m). This will turn  B into a m × m squarematrix without changing the values of  ∥BAx∥2, ∥B∥F and ∥B∥. So without loss of generality, we can assume B is a square matrix. Also without loss of generality, we can further assume  ∥B∥ = 1since the conclusion is invariant under scaling for B.

The remaining proof for Lemma 4.3 is the same as proof for Lemma 4.2 except for bounding  p3 inStep 1. A bound for  p3here can be obtained through the new Bernstein’s inequality (Theorem 1.3) as detailed below.

Recall that

image

where  bi := Bii and Ai is the i-th row of A.The random variables  Yi := ⟨Ai, u⟩ ⟨Ai, v⟩ areindependent, with

image

Here we used  ∥x∥2 = ∥y∥2 = ∥u∥2 = 1, ∥v∥2 ≤2, and that  Aiis isotropic. Furthermore, from property (d) in Appendix A we have

image

Since 0  < s < 2∥B∥F , we get

image

4.3 Proof of Theorem 1.1

Theorem 1.1 follows form the sub-Gaussian increments lemmas and Talagrand’s Majorizing Measure Theorem. Let us first recall the Majorizing Measure Theorem. The following statement is from [20].

Theorem 4.4 (Majorizing Measure Theorem). Let (Zx)x∈Tbe a random process on a bounded set T ⊂ Rn. Assume that the process has sub-Gaussian increments, that is there exists  M ≥ 0 suchthat

image

The first part of Theorem 4.4 can be found in [31, Theorem 2.4.12] and the second part can be found in [7, Theorem 3.2].

image

Using Lemma 4.2 and Theorem 4.4 (Majorizing Measure Theorem), we get

image

Using property (e) in Appendix A and Lemma 4.2, we get

image

Since diam(T) ≤ 2 rad(T), applying Lemma 4.2 and Theorem 4.4 we know that the event

image

holds with probability at least 1  − 2e−u2.Combining these yields the desired high probability bound.

Finally, when B is a diagonal matrix and A is not necessarily mean zero, we can repeat the above argument with Lemma 4.3 instead of Lemma 4.2. This completes the proof.

4.4 An Example for Lower Bound

Here we use scaled Bernoulli random variables to demonstrate that the  K√log Kfactor in Theo- rem 1.1 is optimal in general.

Proposition 4.5. Let K ≥ 3 and X = (X1, . . . , Xm) ∈ Rm be a random vector with independent coordinates such that 1K2 log K X2i ∼ Bernoulli� 1K2 log K�, then ∥Xi∥ψ2 ≤ K, and for m ≥ K2 log K,

image

Note that the expected number of non-zero coordinates for  X is mK2 log K , so m ≥ K2 log Kessentially says X is non-zero in expectation, which is a mild assumption. For the proof of Proposition 4.5, we will need the following lower bound (see [26, Lemma 4.7.2]) about Binomial distributions.

image

Here  D(x∥y) is the Kullback-Leibler divergence between two Bernoulli distributions with parameters x and y respectively given by

image

Moreover, for 0 < y < x < 1,

image

Proof of Proposition 4.5. ∥Xi∥ψ2 ≤ Kfollows directly from definition since

image

Let  λ > 0, Z := ∥X∥2 − √m and L2 := K2 log K, with a change of variable  s = λt/L2 we have

image

To show (16), we need to find a  λ such that E exp(λZ2/L2) > 2.So by a change of variable t = v2L2, it suffices to show

image

Let

image

then

image

where 1L2∥X∥22 ∼ Binomial�m, 1L2�. So by (17) we have

image

with  λ0:= 9 log 9. Here the second inequality uses 2v/βv ≥1 on the interval of integration, and the last inequality holds because

image

Take  λ = λ0 we get

image

This proves (16) with  c = 1/√λ0 ≈ 0.22.

5.1 Johnson-Lindenstrauss Lemma

One immediate application of our result is a guarantee for all isotropic and sub-Gaussian matrices as Johnson-Lindenstrauss (JL) embeddings for dimension reduction. We state this JL lemma below. It follows directly form Theorem 4.1.

Lemma 5.1. Let A ∈ Rm×n be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter

image

Proof. By scaling we can assume  ∥x − y∥2= 1. By Theorem 4.1 (with  B = Im) we have

image

the result then follows from property (a) in Appendix A.

image

on the example in Proposition 4.5, we can further show that (see Appendix B) the dependence on sub-Gaussian parameter K here is also optimal for small  ε, δ. Similar results to Lemma 5.1 have appeared in [8, 22], but to the best of our knowledge, the previous known dependence on K was K4.

5.2 Randomized Sketches

Randomized sketching provide a method for approximating convex programs [25, 35]. In essence, a randomized sketch reduces the dimension of the original optimization problem through random projections, which can be beneficial in both computational time and memory storage. Following the problem formulation and ideas in [25], consider convex program in the form of

image

where  B ∈ Rn×d, y ∈ Rd and C ⊂ Rd is some convex set. Let  A ∈ Rm×n be an isotropic and sub-Gaussian matrix and solve instead the convex program

image

This is called the ”sketched problem”. It reduce the dimension from n to m and can be viewed as an approximation to the original problem (20). Moreover, say a solution ˆx to the sketched problem (21) is  δ-optimal to the original optimal solution  x∗ of (20) if

image

Pilanci and Wainwright [25] gave a high probability guarantee for ˆx being δ-optimal when m is sufficient large. The following Theorem 5.2 improves the dependence on K in their guarantee from  K4 to K2 log K. The proof of Theorem 5.2 is also more concise thanks to the tools we have developed.

Theorem 5.2 (δ-optimal guarantee). Let A be an isotropic and sub-Gaussian matrix with subGaussian parameter  K. For any δ ∈ (0, 1), if

image

then a solution ˆx to the sketched problem as given in (21) is  δ-optimal with probability at least 1  − c1e−c2mδ2/(K2 log K). Here c0, c1, c2are absolute constants and T is the tangent cone of C at optimum  x∗, given by

image

We will use an argument similar to [25] to prove Theorem 5.2. First let us state a deterministic result that says  δ-optimality can be obtained by controlling two quantities.

image

Next we show a technical Lemma that will be helpful when estimating  Z1 and Z2.

Lemma 5.4. Let A be an isotropic and sub-Gaussian matrix with sub-Gaussian parameter K, and let  T ⊂ Rn be a set with radius rad(T) ≤ 2, then there exists absolute constants C and c such that for any  δ ∈ (0, 1),

image

Proof. Denote L := K√log K. By Corollary 1.2 we have

image

holds with probability at least 1  − 3e−mδ20/(9C20L2). On this event,

image

where we use the estimate��� 1√m∥Ax∥2 + ∥x∥2��� ≤ 2∥x∥2 + δ0 for x ∈ T.

Proof of Theorem 5.2. We wish to control the ratio  Z2/Z1in sight of Lemma 5.3. By Lemma 5.4, if  m ≥ CK2 log Kw2(T)/δ2, then

image

Let  T := BT ∩ Sn−1 and Q := 1mAT A − I. Since

image

triangle inequality gives

image

where  u + T := {u + v : v ∈ T}. Applying Lemma 5.4 to  Z(i)2 (i = 1, 2, 3) we get

image

Combining the bounds for  Z1 and Z2 we have

image

with probability at least 1  − 12e−cmδ2/(K2 log K). This completes the proof.

5.3 Favorable Landscape for Blind Demodulation with Generative Priors

In this section, we give a concrete example where the improvement on the sub-Gaussian parameter K can be important through blind demodulation with generative priors.

Blind demodulation aims to recover two signals  x0, y0 ∈ Rl from observation  z0 = x0 ◦y0, wheredenotes componentwise multiplication. Due to the inherent nature of ambiguity of the solutions from  z0, one usually assume that the signals come with some structure. A traditional way to model this structure is through a sparsity prior with respect to a basis such as wavelet or the Discrete Cosine Transform basis in case the signals are images.

On the other hand, with recent development in deep learning, the generative adversarial network (GAN) is turning out to be very effective in generating realistic synthetic images, which naturally indicates that we may model a certain type of image signals as outputs of GAN. Especially in the inverse problems like compressed sensing, phase retrieval including this blind demodulation, practitioners have observed an order of magnitude sample (observation) complexity improvement over the sparsity prior [5,12,21].

This alternative model is called the generative prior and as a consequence is becoming a new promising model for modern signal processing [5, 10–12]. In Hand and Joshi [10], the authors provide a global landscape guarantee for blind demodulation problem with generative priors and they applied our Bernstein’s inequality in their proof.

With generative priors, unknown signals  x0, y0are assumed to be in the range of two generative neural networks  G(1) and G(2) respectively. More precisely,  G(1) : Rn → Rl is a d-layer network, G(2) : Rp → Rl is a s-layer network and they can be written as

image

where relu is the Rectified Linear Unit activation function given by relu(x) = max{x, 0} and W (1)i , W (2)j for i ∈ {1, . . . , d} and j ∈ {1, . . . , s}are weight matrices.

The weight matrices are normally obtained in the training process of the networks but the empirical evidence in [4] suggests that they behave a “random-like” quantity . Based on this phenomenon, the authors of [10] made the following additional assumptions on the networks  G(1)and  G(2) to facilitate analysis further:

image

The signals can then be recovered by finding their latent codes  h0 ∈ Rn and m0 ∈ Rp such thatx0 = G(1)(h0) and y0 = G(2)(m0). This leads to the following empirical risk minimization program:

image

Note that there is a scaling ambiguity in this problem since it does not distinguish points on curve  {(ch, 1cm) : c > 0}for any given (h, m), thus one can only hope to find the solution curve

{(ch0, 1cm0) : c > 0}. The authors in [10] showed that under assumptions A1-A3, two conditions that are called the Weight Distributed Condition (WDC) and the joint-WDC are met. These conditions guarantee a favorable landscape for the objective function f(h, m), namely f has a descent direction at all points outside of a small neighborhood of four curves containing the solution. One of the important ingredients in their proof is concentration bounds for singular values of random matrices. When they showed that the joint-WDC condition is satisfied by concentration argument, they were able to improve the requirement in assumption A3 from, up to log factors, l ≳ n3 + p3 to l ≳ n2 + p2. Such improvement is made possible by our new Bernstein’s inequality with refined sub-exponential parameter dependence. This  n2 + p2 sample complexity matches the one in the previous recovery guarantees with sparsity prior (in which case n and p denotes the sparsity levels), but potentially better since the latent code dimension is oftentimes smaller than a sparsity level with respect to a particular basis. See Theorem 2, Theorem 5, Lemma 8, and Lemma 9 in [10] for more details.

In this article, we proved the optimal concentration bound for sub-Gaussian random matrices on sets. Namely, with high probability,

image

where  B ∈ Rl×m is an arbitrary matrix,  A ∈ Rm×n is an (mean zero) isotropic and sub-Gaussian random matrix,  T ∈ Rn is the set, K is the sub-Gaussian parameter of A, sr(B) is the stable rank of B, w(T) is the gaussian width of  T and rad(T) := supy∈T ∥y∥2. Compared to the previous work in [20], this result generalizes by allowing an arbitrary matrix B while improves the dependency on the sub-Gaussian parameter from  K2 to the optimal  K√log K. Consequently, this can lead to a tighter concentration bound even in the cases where the sub-Gaussian matrix BA have correlated rows. It is also worth noting that dependence on w(T) + rad(T) is optimal in general as well.

We also proved, under extra moment conditions, a new Bernstein type inequality and a new Hanson-Wright inequality. The extra conditions here are bounded first absolute moment (e.g. E|Yi| ≤2) for Bernstein’s inequality and bounded second moment (e.g.  EX2i = 1) for Hanson- Wright inequality. In many cases, these conditions can be easily met – for example, they are implied by the isotropic condition of random variables or vectors. In general, both of our new inequalities give improved tail bounds in the sub-Gaussian regime, which is the regime of interest in many applications as demonstrated in Section 5.

Y. Plan is partially supported by an NSERC Discovery Grant (22R23068), a PIMS CRG 33: HighDimensional Data Analysis, and a Tier II Canada Research Chair in Data Science. ¨O. Yılmaz is partially supported by an NSERC Discovery Grant (22R82411) and PIMS CRG 33: HighDimensional Data Analysis. H. Jeong is funded in part by the University of British Columbia Data Science Institute (UBC DSI) and by the Pacific Institute of Mathematical Sciences (PIMS). The authors of this paper would also like to thank Babhru Joshi for the helpful discussions and providing us with an important application in Section 5.

[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.

[2] R. Adamczak. Logarithmic sobolev inequalities and concentration of measure for convex functions and polynomial chaoses. Bulletin of the Polish Academy of Sciences Mathematics, 53(2):221–238, 2005.

[3] R. Adamczak. A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.

[4] S. Arora, Y. Liang, and T. Ma. Why are deep nets reversible: A simple theory, with implica- tions for training. arXiv preprint arXiv:1511.05653, 2015.

[5] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 537–546. JMLR. org, 2017.

[6] E. J. Cand`es. Mathematics of sparsity (and a few other things). In Proceedings of the International Congress of Mathematicians, Seoul, South Korea, volume 123, 2014.

[7] S. Dirksen. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.

[8] S. Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Foundations of Computational Mathematics, 16(5):1367–1396, 2016.

[9] Y. Gordon. On milman’s inequality and random subspaces which escape through a mesh in Rn. In Geometric Aspects of Functional Analysis, pages 84–106. Springer, 1988.

[10] P. Hand and B. Joshi. Global guarantees for blind demodulation with generative priors. In Advances in Neural Information Processing Systems, pages 11531–11541, 2019.

[11] P. Hand, O. Leong, and V. Voroninski. Phase retrieval under a generative prior. In Advances in Neural Information Processing Systems, pages 9136–9146, 2018.

[12] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. In Proceedings of Machine Learning Research, volume 75, pages 1–8, 2018.

[13] P. Hitczenko, S. Kwapie´n, W. Li, G. Schechtman, T. Schlumprecht, and J. Zinn. Hypercontractivity and comparison of moments of iterated maxima and minima of independent random variables. Electronic Journal of Probability, 3, 1998.

[14] D. M. Kane and J. Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014.

[15] B. Klartag and S. Mendelson. Empirical processes and random projections. Journal of Functional Analysis, 225(1):229, 2005.

[16] Y. Klochkov and N. Zhivotovskiy. Uniform hanson-wright type concentration inequalities for unbounded entries via the entropy method. arXiv preprint arXiv:1812.03548, 2018.

[17] F. Krahmer, D. Needell, and R. Ward. Compressive sensing with redundant dictionaries and structured measurements. SIAM Journal on Mathematical Analysis, 47(6):4606–4629, 2015.

[18] F. Krahmer and H. Rauhut. Structured random measurements in signal processing. GAMMMitteilungen, 37(2):217–238, 2014.

[19] K. G. Larsen and J. Nelson. The johnson-lindenstrauss lemma is optimal for linear dimensionality reduction. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 82:1–82:11, 2016.

[20] C. Liaw, A. Mehrabian, Y. Plan, and R. Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric Aspects of Functional Analysis, pages 277–299. Springer, 2017.

[21] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos. Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Processing Magazine, 35(1):20–36, 2018.

[22] J. Matouˇsek. On variants of the johnson–lindenstrauss lemma. Random Structures & Algorithms, 33(2):142–156, 2008.

[23] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17(4):1248–1282, 2007.

[24] S. Oymak and J. A. Tropp. Universality laws for randomized dimension reduction, with applications. Information and Inference: A Journal of the IMA, 7(3):337–446, 2017.

[25] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guar- antees. IEEE Transactions on Information Theory, 61(9):5096–5115, 2015.

[26] B. Robert. Ash. Information Theory. Dover Publications Inc., 1990.

[27] M. Rudelson and R. Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013.

[28] R. Saab, R. Wang, and ¨O. Yılmaz. From compressed sensing to compressed bit-streams: practical encoders, tractable decoders. IEEE Transactions on Information Theory, 64(9):6098– 6114, 2018.

[29] P.-M. Samson. Concentration of measure inequalities for Markov chains and Φ-mixing pro- cesses. The Annals of Probability, 28(1):416–461, 2000.

[30] G. Schechtman. Two observations regarding embedding subsets of euclidean spaces in normed spaces. Advances in Mathematics, 200(1):125–135, 2006.

[31] M. Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014.

[32] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

[33] V. Vu and K. Wang. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures & Algorithms, 47(4):792–821, 2015.

[34] D. P. Woodruff. Sketching as a tool for numerical linear algebra.  Foundations and Trends R⃝in Theoretical Computer Science, 10(1–2):1–157, 2014.

[35] Y. Yang, M. Pilanci, M. J. Wainwright, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017.

image

(d)  ∥XY ∥ψα ≤ ∥X∥ψpα∥Y ∥ψqα for p, q ∈ (1, ∞) such that 1p + 1q = 1. In particular,  ∥XY ∥ψ1 ≤∥X∥ψ2∥Y ∥ψ2;

image

In particular, properties (a) and (b) implies that a random variable is sub-Gaussian (or sub-exponential) if and only if its tail probability is bounded by a Gaussian (or exponential) random variable. Properties (c) and (d) tell us if X and Y are both sub-Gaussian, then  X2, Y 2 and XYare all sub-exponential. Property (e) tells us for  p ≥ 1, all p-th moments of X exist whenever ∥X∥ψαis finite. Property (f) tells us we can always center random variables without changing their ψα-norm up to a constant factor. Property (g) tells us all sub-Gaussian random variables are also sub-exponential random variables.

image

Since  P(|X|α ≥ u) = P(|X| ≥ u1/α) ≤ 2 exp(−u/K2), we get

image

(c) This follows from definition.

image

image

Applying Young’s inequality again we have

image

This shows  ∥XY ∥ψα ≤ 1.

(e) Without loss of generality, we can assume  ∥X∥ψα= 1. Then by property (a),

image

With a change of variable  u = tα we have

image

where Γ(·) denotes the Gamma function. Note that for s > 0,

image

where we used the fact that  xse− x2attains maximum at x = 2s because

image

Therefore

image

Using property (d) and the fact that  ∥1∥ψα =� 1log 2�1/αwe have

image

This completes the proof with  C = 1 + 4log 4 ≈ 6.77.

(g) Without loss of generality, we can assume  ∥X∥ψβ= 1. Then by property (a),

image

Therefore by property (b) we have

image

Here we give an example to demonstrate the  K2 log Kdependence for sample complexity in the JL Lemma (Lemma 5.1) is optimal for small  ε and δ. This example is virtually the same as the one in Proposition 4.5, however, this result is not implied by Proposition 4.5 as the latter does not guarantee such dependence on K is optimal when  ε is small.

Proposition B.1. Suppose random matrix  A ∈ Rm×n has symmetric i.i.d. entries  Aij suchthat  A2ij ∼ L2m · Bernoulli�L−2�. Assume L ≥ 2and denote K the positive number such that L2 = K2 log K. Also denote  e1 = (1, 0, . . . , 0)T . If for any fixed  ε, δ ∈ (0, 15), the probability bound

image

(a) √mAis an isotropic and sub-Gaussian matrix with sub-Gaussian parameter being no more than K.

image

Proof. Recall that  ∥Aij∥ψ2 ≤ K, so part (a) is straightforward to verify. Now we proceed with proof for part (b). Notice that

image

Since√8k ≤ 4√mL and

image

we get

image

Therefore

image

Here we list and prove the non-standard inequalities used in our proofs.

image

Proof. (a) From e−2x ≥ 1 − 2x we have

image

Taking exponential we get  αe−x < 2 exp(− x log 2log α ).


Designed for Accessibility and to further Open Science