Gaussian Approximation of Quantization Error for Estimation from Compressed Data

2020·Arxiv

Abstract

Abstract

I. INTRODUCTION

Due to the disproportionate size of modern datasets compared to available computing and communication resources, many inference techniques are applied to a compressed representation of the data rather than the data itself (Figure 1). In the attempt to develop and analyze inference techniques based on a degraded version of the data,

Fig. 1. Inference about the latent signal is based on degraded observations Y of the data X.

Fig. 2. The effect of a bitrate constraint is compared to the effect of additive Gaussian noise by studying the Wasserstein distance between . Under random spherical encoding, we show that this distance is bounded in the problem dimension n, hence estimating Y is equivalent to estimating it from Z.

it is conceptually appealing to model inaccuracies resulting from lossy compression as additive noise. Indeed, there exists a rich literature devoted to the characterization of this “noise”, i.e., the difference between the original data and its compressed representation [2]. Nevertheless, because of the difficulty of analyzing non-linear compression operations, this characterization is generally limited to the high-bit compression regime and other restrictions on the distribution of the data [3]–[7].

In this paper, we establish a strong and relatively simple characterization of the distribution of quantization error corresponding to a random spherical code. Specifically, we show that, in the sense of the Wasserstein distance, this error can be approximated by additive white Gaussian noise (AWGN) whose variance is inversely proportional to where R is the bitrate of the code (Figure 2). This approximation implies that the expected error of an estimator applied to the compressed representation of the data is asymptotically equivalent to the expected error of the same estimator applied to a Gaussian noise-corrupted version of the data. The benefit from such approximation is twofold: (1) inference techniques from observations corrupted by Gaussian noise can now be applied directly to the compressed representation; and (2) it provides a mechanism to characterize the performance of inference using such techniques.

A. Overview of Main Contributions

The equivalence illustrated in Figure 2 allows us to derive various novel results for two closely related inference settings, both of which are performed on a lossy compressed representation of the observed data .

• Parameter Estimation: (Section III) The data are drawn according to a distribution indexed by an unknown

• Indirect Source Coding: (Section IV) The data are distributed jointly with an unknown random (source) vector

At a high level, the main difference between these inference tasks is that the source coding problem assumes a joint distribution over the data and the quantities of interest. Beyond these settings, one may may also consider minimax and universal source coding formulation [22], [23] as well as hypotheses testing [8], [10].

In the parameter estimation setting, we consider the minimax mean-squared error (MSE)

where the infimum is over all encoding functions and decoding functions with . Zhu and Lafferty [24] provided an asymptotic expression for in the special case of the Gaussian location model where the parameter space is an n-dimensional ball. Under a similar setting, our main results yield a non-asymptotic upper bound to . Furthermore, under the additional assumption that is k-sparse, our main results implies that is upper bounded by a univariate function describing the minimax risk of soft-thresholding in sparse estimation [25], [26]. Finally, we consider the case where the data X and the parameter are described by the model , where is a random matrix with i.i.d. Gaussian entries. This setting with sparse and much larger than n was studied in the context of the compressed sensing signal acquisition frameworks [27]. By applying our main results to estimation with the approximate message passing (AMP) algorithm [28], we provide an exact asymptotic characterization of the MSE in recovering from a lossy compressed version of X obtained using bitrate-R random spherical coding. Versions of this compression and estimation problem for other type of lossy compression codes and estimators were considered in [29]–[32].

The indirect source coding setting (other names are remote or noisy source coding and rate-constrained denoising) corresponds to the case where is an ergodic process with a finite second moment. A bitrate-R spherical code is applied to while the goal is to estimate from the output Y of this code [18] [20] [19, Ch 3.5] [21]. For data normalized as , our main results imply that the MSE attained by any sequence

of Lipschtiz estimators converges to the MSE attained by these estimators when applied to , where

Specialized to the case , our result implies an interesting universality property of spherical coding followed by linear estimation: the resulting MSE equals the minimal MSE, over all encoding and estimation schemes, when a Gaussian source of the same second moment is estimated from a bitrate-R encoded version of its observations under AWGN. This fact can be seen as a direct extension of the saddle point property of the Gaussian distribution in the standard (direct) source coding setting discussed in [33]–[36].

B. Background and Related Works

Spherical codes have multiple theoretical and practical uses in numerous fields [37]. In the context of information theory, Sakrison [33] and Wyner [34] provided a geometric understanding of random spherical coding in a Gaussian setting; our main result extends their insights. Specifically, consider the representation of an n-dimensional standard Gaussian vector X using codewords uniformly distributed over the sphere of radius . The left side of Fig. 3, adapted from [33], shows a conceptual relation between X and its nearest codeword : As n increases, the angle between the two concentrates so that converges to in probability, hence the quantized representation of X and the error become orthogonal. Consequently, the MSE between X and its

Fig. 3. Conceptual 3D illustration of random spherical coding in high-dimension. The norm of the input vector X concentrates around the input sphere. Left Sphere: Geometric interpretation of standard source coding from [33], [34]. The representation sphere is chosen such that the error vector is orthogonal to the reconstruction vector . Right Sphere: Geometric description of the quantization error considered in this paper. The representation sphere is chosen such that is orthogonal to X.

quantized representation, averaged over all random codebooks, converges to the Gaussian distortion-rate function . In fact, as noted in [35], this Gaussian coding scheme1 achieves the Gaussian DRF when X is generated by any ergodic information source of unit variance, implying that the second moments of are independent of the distribution of X as the problem dimension n goes to infinity.

In this paper, we show that a much stronger statement holds for a properly scaled version of the quantized representation (Y in Fig. 3): in the limit of high dimension, the distribution of is independent of the distribution of X and is approximately Gaussian. This property of suggests that the underlying quantities of interest (e.g., the parameter vector or the sequence ) can now be estimated as if X is observed under additive Gaussian noise. This paper formalizes this intuition by showing that estimators from the Gaussian-noise corrupted version of X (Z in Fig. 3) attain similar performances if applied to the scaled representation Y .

In general, the radius of the codebook under which the distribution of Y and Z are close depends on the magnitude of X. This magnitude is only needed at the decoder, while the encoder can represent its input X using codewords living, say, on the unit sphere. In particular, such an encoder is agnostic to the relationship between X and . This situation is in contrast to optimal quantization schemes in indirect source coding [18], [19] and in problems involving estimation from compressed data [9], [10], [13], [14], [39], [40], where the specification of the model is crucial for designing the compression and estimation schemes. As a result, the random spherical coding scheme we rely on is sub-optimal in general, although it can be applied in situations where the model is unknown at the compressor. Coding schemes with similar properties were studied in the context of indirect source coding under the name compress-and-estimate in [41], [42].

The equivalence between quantization noise and AWGN we provide in this paper is given in terms of the Wasserstein distance between the distributions of these vectors. We refer to [43]–[45] for properties, applications, and the long list of alternative names of the Wasserstein distance. In the context of information theory, the Wasserstein distance has been used to establish consistency of some quantization procedures [46], [47] and to define a class of channels over which communication is possible without assuming synchronization [48], [49]. One of the core results of this paper is a novel coupling of the distributions of Y and Z given X, leading to a bound on the Wasserstein distance between them. This bound, in combination with the fact that the risk of a Lipschitz estimator is continuous with respect to the Wasserstein distance, implies that the risk of such an estimator, when used at the output of a random spherical code, converges to the risk when used at the output of a Gaussian channel.

Our work is similar in spirit to the work of Zamir and Feder [50, Section III], who provide a Gaussian approximation for the quantization error of a random dithered lattice quantizer. In the setting of their paper, the quantization error is independent of the input and distributed uniformly over the basic cell of the lattice. They show that as the dimension n increases, there exists a sequence of lattice quantizers such that the relative entropy between the distribution of the quantization noise and the isotropic Gaussian distribution with matched power is bounded from above by where c is a positive constant. Normalizing by n, they conclude that the relative entropy per dimension converges to zero in the large-n limit. To interpret their results in the setting of this paper, we can use the Gaussian transportation inequality [51], which leads to an upper bound on the 2-Wasserstein distance that is order where is the variance of the additive noise. By contrast, for quantization using a random spherical code, our results provide an upper bound on p-Wasserstein distance that is order . Namely, both bounds are proportional to but the bound following from [50] is unbounded in the problem dimension.

Another related setting is the problem of channel simulation, the goal of which is to design a random code that induces a particular target distribution between the data and the compressed representation [52]–[54].

The rest of this paper is organized as follows. In Section II we provide our main results on the distributional connection between spherical coding and AWGN. In Sections III and IV we apply these results to parameter estimation and source coding, respectively. Section V provides the proofs of the main results. Concluding remark are provided in Section VI.

II. MAIN RESULTS

The main result of this paper is a comparison between the quantization error under random spherical coding and independent Gaussian noise.

Definition 1 (Random Spherical Code). An (n, M) random spherical code is a collection of M codewords C = {C(1), . . . , C(M)} drawn independently from the uniform distribution on the unit sphere in . The encoder maps

an input vector to the index of a codeword that maximizes the cosine similarity

Given the index i and knowledge of the codebook C, the decoder outputs the compressed representation

where is a scaling parameter.

Fig. 4. Conceptual 2D description of the coupling in Theorem 1 when the representation sphere is matched to . The quantization error (Left) is compared to the standard n-dimensional Gaussian noise vector W (Right). The random variable Y is the nearest codeword to x in the random codebook ensemble. The standard Gaussian random variable A is the normalized component of W in the direction of x. The random variable B describes the magnitude of the projection of -dimensional space orthogonal to x.

Our results focus on the distribution of the compressed representation Y induced by the randomness in the codebook. Note that this distribution is parameterized by the input x and has a density with respect to the surface

measure on the sphere of radius . We also define the maximal cosine similarity according to

It is well known (see e.g., [55]) that the distribution of does not depend on x and is given by

where

and where is the Gamma function.

A. Approximation using AWGN

The fundamental question we address is the extent to which the quantization error can be approximated by an isotropic zero-mean Gaussian noise. To answer this question we introduce the AWGN-corrupted observation

model

Our main results are based on a coupling argument. Specifically, we show that there exists a joint distribution on the pair (Y, Z) under which the distribution of can be described exactly in terms of the tuple and the magnitude of the input. The proof of the following result is in Section V.

Theorem 1. For any , positive integer M, and real numbers , there exists a joint distribution on (Y, Z) such that Y has the distribution of the compressed representation of magnitude obtained from an

(n, M) random spherical code with input , and

where:

• are independent,

• 1),

• B has a chi distribution with degrees of freedom.

Figure 4 provides a conceptual illustration of , and B in the comparison between Y and Z provided in Theorem 1. The proof of this theorem in Section V provides the exact description of these random variables and vectors.

The coupling described in Theorem 1 holds for any choice of the parameters . The next step is to show that the term in (8) is negligible compared to the magnitude of the quantization error under a proper specification of these parameters. To provide a sense of scale, observe that the error due to AWGN satisfies

where concentrates about in the large-n limit. For comparison, we consider the upper bound given by

where is an estimate of and the normalized error term

does not depend on x. In the following we show that can be chosen as a function of such that the distribution of is bounded independently of the dimension n.

First we consider the setting where the number of codewords is given by for a fixed bitrate R > 0.

For a given we use the specification

Theorem 2. Suppose that for a fixed bitrate R > 0. Under the specification given in (11), the normalized error term defined in (10) has a sub-Gaussian distribution with parameters that depend only on the

bitrate R. In particular, there exists a positive number such that

The significance of Theorem 2 is that the distribution of is bounded uniformly with respect to n and thus the term in (9) is order one. By comparison, the magnitude of the additive noise scales at rate . This means that if the estimate of is accurate in the sense that as , then the relative difference between the mismatch and the quantization error converges to zero.

Next, we provide a result that holds in the high-rate setting where diverges. This regime requires a more

precise estimate of the max-cosine similarity and we use the specification

Note that is normalized by instead of n. Further, we will require that the number of codewords is bounded

from below by

for some fixed constant .

Theorem 3. Suppose that for some constant . Under the specification given in (13), the normalized error term defined in (10) has a sub-exponential distribution with parameters that depend only on

. In particular, there exists a positive number such that

Theorem 3 is stronger than Theorem 2 in the sense that the bound holds uniformly for all (n, M) satisfying the constraint . This is important for the high-bitrate setting where converges to zero. The price that is paid is that the sub-Gaussian tail condition is replaced with the weaker sub-exponential condition. An explicit value for the constant and its dependence on can be found in the proof of Theorem 3, which is given in Appendix A.

B. Bounds on Wasserstein Distance

Our results can also be stated in terms of the Wasserstein distance on distributions. The p-Wasserstein distance

between distributions P and Q on is defined by

where the infimum is over all joint distributions on (U, V ) satisfying the marginal constraints and . For , the p-Wasserstein distance is a metric on the space of distributions with finite p-th moments.

Theorems 2 and 3 imply upper bounds on the Wasserstein distance between the distribution of the compressed representation obtained using a random spherical code and the distribution of the AWGN-corrupted version of the input.

Theorem 4. Let X be a random vector in with for some . Let be the distribution of the compressed representation of magnitude obtained from an (n, M) random spherical code with input X and let be the distribution of where W is an independent standard Gaussian vector.

(ii) If for some and are defined as in (13), then

Proof. The p-th power of the Wasserstein distance is convex in the pair (P, Q) [45, Theorem 4.8], and thus

where and denote the conditional distributions of Y and Z, respectively. In view of (9) and (12),

it follows that

Combining these displays with Minkowski’s inequality leads to (16). Inequality (17) is obtained in a similar manner from (14).

A useful property of the Wasserstein distance is that it controls the expectations of Lipschitz continuous functions.

Recall that a mapping is Lipschitz continuous if there exists a constant L such that

The infimum over all L is call the Lipschitz constant and is denoted by by . The 1-Wasserstein distance,

which is also known as the Kantorovich–Rubinstein distance, can be expressed equivalently as

where and . More generally, the p-Wasserstein can be used to bound the difference between p-th moments.

Proposition 5. Let and be random vectors on . For any Lipschitz function ,

provided that the expectations exist.

Proof. For any coupling of (U, V ), Minkowski’s inequality and the Lipschitz assumption on f yield

Taking the infimum over all possible couplings leads to one side of the inequality. Interchanging the role of U and V and repeating the same steps gives the other side.

C. Concentration of the Norm

The bounds in Theorems 2 and 3 simplify further when the parameter in (10) is matched to the magnitude of the input. As a specific example, suppose that the data is known to lie on the sphere of radius and that . By setting , we see that there exists a coupling of Y and Z under which the quantization error is bounded by a sub-Gaussian random variables that is independent of n.

For many applications, however, the assumption that the data lay on a sphere of known radius is too restrictive. Therefore, in this paper, we assume that the data at the input to the compressor is a random vector X in whose magnitude concentrates about a known value . This assumption is reasonable for high-dimensional settings where the entries of X are weakly correlated. More generally, there are a number of other approaches that can be used to deal with the fact that is unknown. One approach is to use additional bits to encode the magnitude of X, as is done in [24]. For example, if almost surely where is a known constant, then bits

are sufficient to encode with absolute error less than , such that

When n is large, the logarithmic number of bits used to encode the magnitude of X is negligible compared to the nR bits used to encode its direction. An alternative approach is to compare the compressed representation with a noisy version of X after it has been projected onto the unit sphere in . This can be achieved, by setting and redefining the input to be such that the magnitude is equal to one almost surely. In both of the approaches described above, the noise variance is scaled in such a way that the signal-to-noise ratio in the AWGN observation model (7) depends only on (n, R) and is given by . One may also consider a variable-length coding strategy that adapts the number of bits to the magnitude of X such that the effective noise power is constant and the signal-to-noise ratio is proportional to . We leave this as a direction for future work.

III. APPLICATION TO PARAMETER ESTIMATION

In this section, we apply our main results to the problem of estimating an unknown parameter vector from a compressed representation of the data X. For each integer n, let be a family of distributions on with index set . For the purposes of exposition we will focus on the squared error loss. Our approach is quite general, however, and can be extended to other loss functions.

An important performance benchmark in estimating from a bitrate-R compressed representation of X is the minimax MSE:

where the minimum is over all encoding functions and decoding functions with .

Zhu and Lafferty [24] studied the asymptotic minimax MSE for the Gaussian location model

with the n-dimensional Euclidean ball of radius , and showed that

Their achievability result is based on random spherical coding while devoting a number of bits sublinear in n to encode the magnitude of the X, as discussed in Section II-C above.

The comparison between quantization error and Gaussian noise in Theorem 4 provides a straightforward method for obtaining non-asymptotic upper bounds on the minimax MSE that can be applied to a large class of models. The basic idea is to study the MSE of Lipschitz estimators applied to the AWGN-corrupted data. We use the following assumption, which says that concentrates on a spherical shell whose radius does not depend on .

Assumption 1 (Concentration of Magnitude). There exists a sequence of positive numbers such that

Assumption 1 provides a way to formulate many cases of interest in terms of the radius of the shell and its width .

The next result uses this assumption to bound the difference in root MSE between an estimator applied to the compressed representation Y and the same estimator applied to the AWGN-corrupted version Z.

Theorem 6. Let be a sequence of models that satisfies Assumption 1. Given , let Y be the output of an random spherical code with input X and output scaling and let where is independent of X and is according to (11). For any Lipschitz estimator , the

(26) where C is a constant that depends only on the bitrate R and . Furthermore, for all t > 0, the

minimax MSE satisfies

where we have used the fact that f is the composition of with the 1-Lipschitz function , and thus . By Theorem 4 and Assumption 1, the Wasserstein distance is upper bounded by ,

where is given in (11). It follows that

where only depends on R. The upper bound on the minimax MSE follows from the inequality

One takeaway from Theorem 6 is that the MSE obtained from the compressed representation is asymptotically equivalent to that of the AWGN-corrupted observation, provided that the Lipschitz constant of the estimator is small enough. To gain insight into the interplay between the Lipschitz constant of the estimator, the magnitude of the data, and the typical size of the squared error, it is useful to consider some concrete examples.

A. Gaussian Location Model

For our first example, we consider the Gaussian location model . Assume that the parameter set is a subset of the spherical shell:

As an intuition for this notation, one may think about as an estimate for the magnitude of and as the uncertainty in this estimate. For example, if the entries of are sampled independently from a sub-Gaussian distribution with a second moment , then is sub-Gaussian [56, Thm 3.1.1]. In this case, there exists a constant C independent of n such that for with probability at least .

The next statement says that under a Gaussian location model, X concentrates whenever is restricted.

Proposition 7. Consider the model with . Assumption 1 is satisfied with and .

The assumption implies

Consequently, the second term on the right-hand side of (29) is upper bounded by . For the first term, we write

where we have used the fact that has a non-central chi-squared distribution with n degrees of freedom

and non-centrality parameter

In this setting, the AWGN-corrupted data Z, corresponding to a bitrate R and magnitude , is drawn according to the Gaussian location model whose noise variance depends on the original noise level and the bitrate R:

The MSE in the Gaussian location model has been studied extensively. If we restrict our attention to linear estimators

of the form then a standard calculation (see e.g., [26, Ch. 4.8]) gives

where the minimum over is attained at . By expressing the right-hand side as a function of R and combining with Theorem 6, we obtain a non-asymptotic upper bound on the minimax MSE.

Proposition 8. Consider the model with . Let Y be the output of a bitrate-R random

spherical code applied to X and scaled to the radius. Then

where C is a constant that depends on but not n.

Proof. We have , and for the linear estimator . Following Proposition 7, Assumption 1 is satisfied with . We use Theorem 6 with and .

It follows that there exists a constant c such that

Grouping factors leads to (32).

Since

Proposition 8 recovers parts of the results in [24] by showing that there exists a bitrate-R coding scheme with minimax risk approaching (24). Furthermore, Proposition 8 shows that the minimax risk under such scheme converges at rate , which is faster than the convergence rate established in [24] by a factor of .

More generally, we can also provide bounds for non-linear estimators. The case of a k-sparse parameter vector

can be modeled as

where denotes the number of nonzero entries in . A great deal of work has studied the MSE of the soft-

thresholding estimator

in the model (30) [25], [26]. Specifically, we have

where, for ,

where and are the cumulative and density functions of the standard Gaussian distribution, respectively. Let be the minimizer in (34).

Proposition 9. Let where . Let Y be the output of a bitrate-R

random spherical code applied to X and scaled to the radius. Then

where C is a constant that depends on but neither k or n.

Proof. For any we have . Equation (36) follows from Theorem 6 by using and grouping factors.

Corollary 10. Assume that . The bitrate-R constrained minimax risk over , with , satisfies

B. Linear Model with IID Matrix

For the next example, we consider the linear model

where A is a known matrix, is an unknown d-dimensional vector, and a known noise variance. In this setting, the AWGN-corrupted version of X given by with corresponds to a linear

model with larger noise variance, that is

We study the approximate message passing (AMP) algorithm [28] to estimate from Z. AMP is an iterative algorithm that can be defined by a sequence of scalar denoising functions with that are assumed to be Lipschitz continuous, and hence differentiable almost everywhere. Starting with an initial points

and , a sequence of estimates is generated according to

where is applied comopontwise and with .

The main result of [28], [57] says that the MSE of each iteration of AMP can be characterized precisely in the high-dimensional limit when A is a realization of a random matrix with i.i.d. zero-mean Gaussian entries. To formally state and use this result, we need to the following assumptions:

Assumption 2. is a sequence of -dimensional vectors such that as n goes to infinity. The empirical distributions of , i.e., the probability distribution that puts a point mass at each of the entries of , converges weakly to a distribution on R with finite second moment . Furthermore, converges to as .

Assumption 3. is a sequence of models defined by , where the entries of A are i.i.d. N(0, 1/n).

For a fixed n, we further consider a sequence of estimators for defined as follows:

Assumption 4. is a sequence of scalar, Lipschitz continuous, and differentiable denoisers . For every , the approximate message-passing (AMP) estimator is defined as the results of t iterations of (39) and (40).

The characterization of the MSE of the estimator in the high-dimensional limit is given by the state evolution recursion. This recursion is defined in terms of a distribution on R, sampling ratio , and initial noise

level , as

where and . Finally, define

where is given by t iterations of (41). Under assumptions 2-4 above, [57, Thm. 1] implies that

Combining this result with Theorem 6, we conclude the following:

Theorem 11. Consider a sequence of problems satisfying Assumptions 2 and 3. Let Y be the output of an random spherical code applied to X with radius for some R > 0. Let be an estimator satisf