Due to the disproportionate size of modern datasets compared to available computing and communication resources, many inference techniques are applied to a compressed representation of the data rather than the data itself (Figure 1). In the attempt to develop and analyze inference techniques based on a degraded version of the data,
Fig. 1. Inference about the latent signal is based on degraded observations Y of the data X.
Fig. 2. The effect of a bitrate constraint is compared to the effect of additive Gaussian noise by studying the Wasserstein distance between . Under random spherical encoding, we show that this distance is bounded in the problem dimension n, hence estimating
Y is equivalent to estimating it from Z.
it is conceptually appealing to model inaccuracies resulting from lossy compression as additive noise. Indeed, there exists a rich literature devoted to the characterization of this “noise”, i.e., the difference between the original data and its compressed representation [2]. Nevertheless, because of the difficulty of analyzing non-linear compression operations, this characterization is generally limited to the high-bit compression regime and other restrictions on the distribution of the data [3]–[7].
In this paper, we establish a strong and relatively simple characterization of the distribution of quantization error corresponding to a random spherical code. Specifically, we show that, in the sense of the Wasserstein distance, this error can be approximated by additive white Gaussian noise (AWGN) whose variance is inversely proportional to
where R is the bitrate of the code (Figure 2). This approximation implies that the expected error of an estimator applied to the compressed representation of the data is asymptotically equivalent to the expected error of the same estimator applied to a Gaussian noise-corrupted version of the data. The benefit from such approximation is twofold: (1) inference techniques from observations corrupted by Gaussian noise can now be applied directly to the compressed representation; and (2) it provides a mechanism to characterize the performance of inference using such techniques.
A. Overview of Main Contributions
The equivalence illustrated in Figure 2 allows us to derive various novel results for two closely related inference settings, both of which are performed on a lossy compressed representation of the observed data .
• Parameter Estimation: (Section III) The data are drawn according to a distribution indexed by an unknown
• Indirect Source Coding: (Section IV) The data are distributed jointly with an unknown random (source) vector
At a high level, the main difference between these inference tasks is that the source coding problem assumes a joint distribution over the data and the quantities of interest. Beyond these settings, one may may also consider minimax and universal source coding formulation [22], [23] as well as hypotheses testing [8], [10].
In the parameter estimation setting, we consider the minimax mean-squared error (MSE)
where the infimum is over all encoding functions and decoding functions
with
. Zhu and Lafferty [24] provided an asymptotic expression for
in the special case of the Gaussian location model
where the parameter space
is an n-dimensional ball. Under a similar setting, our main results yield a non-asymptotic upper bound to
. Furthermore, under the additional assumption that
is k-sparse, our main results implies that
is upper bounded by a univariate function describing the minimax risk of soft-thresholding in sparse estimation [25], [26]. Finally, we consider the case where the data X and the parameter
are described by the model
, where
is a random matrix with i.i.d. Gaussian entries. This setting with
sparse and
much larger than n was studied in the context of the compressed sensing signal acquisition frameworks [27]. By applying our main results to estimation with the approximate message passing (AMP) algorithm [28], we provide an exact asymptotic characterization of the MSE in recovering
from a lossy compressed version of X obtained using bitrate-R random spherical coding. Versions of this compression and estimation problem for other type of lossy compression codes and estimators were considered in [29]–[32].
The indirect source coding setting (other names are remote or noisy source coding and rate-constrained denoising) corresponds to the case where is an ergodic process with a finite second moment. A bitrate-R spherical code is applied to
while the goal is to estimate
from the output Y of this code [18] [20] [19, Ch 3.5] [21]. For data normalized as
, our main results imply that the MSE attained by any sequence
of Lipschtiz estimators converges to the MSE attained by these estimators when applied to , where
Specialized to the case , our result implies an interesting universality property of spherical coding followed by linear estimation: the resulting MSE equals the minimal MSE, over all encoding and estimation schemes, when a Gaussian source of the same second moment is estimated from a bitrate-R encoded version of its observations under AWGN. This fact can be seen as a direct extension of the saddle point property of the Gaussian distribution in the standard (direct) source coding setting discussed in [33]–[36].
B. Background and Related Works
Spherical codes have multiple theoretical and practical uses in numerous fields [37]. In the context of information theory, Sakrison [33] and Wyner [34] provided a geometric understanding of random spherical coding in a Gaussian setting; our main result extends their insights. Specifically, consider the representation of an n-dimensional standard Gaussian vector X using codewords uniformly distributed over the sphere of radius
. The left side of Fig. 3, adapted from [33], shows a conceptual relation between X and its nearest codeword
: As n increases, the angle
between the two concentrates so that
converges to
in probability, hence the quantized representation of X and the error
become orthogonal. Consequently, the MSE between X and its
Fig. 3. Conceptual 3D illustration of random spherical coding in high-dimension. The norm of the input vector X concentrates around the input sphere. Left Sphere: Geometric interpretation of standard source coding from [33], [34]. The representation sphere is chosen such that the error vector is orthogonal to the reconstruction vector
. Right Sphere: Geometric description of the quantization error considered in this paper. The representation sphere is chosen such that
is orthogonal to X.
quantized representation, averaged over all random codebooks, converges to the Gaussian distortion-rate function . In fact, as noted in [35], this Gaussian coding scheme1 achieves the Gaussian DRF when X is generated by any ergodic information source of unit variance, implying that the second moments of
are independent of the distribution of X as the problem dimension n goes to infinity.
In this paper, we show that a much stronger statement holds for a properly scaled version of the quantized representation (Y in Fig. 3): in the limit of high dimension, the distribution of is independent of the distribution of X and is approximately Gaussian. This property of
suggests that the underlying quantities of interest (e.g., the parameter vector
or the sequence
) can now be estimated as if X is observed under additive Gaussian noise. This paper formalizes this intuition by showing that estimators from the Gaussian-noise corrupted version of X (Z in Fig. 3) attain similar performances if applied to the scaled representation Y .
In general, the radius of the codebook under which the distribution of Y and Z are close depends on the magnitude of X. This magnitude is only needed at the decoder, while the encoder can represent its input X using codewords living, say, on the unit sphere. In particular, such an encoder is agnostic to the relationship between X and . This situation is in contrast to optimal quantization schemes in indirect source coding [18], [19] and in problems involving estimation from compressed data [9], [10], [13], [14], [39], [40], where the specification of the model
is crucial for designing the compression and estimation schemes. As a result, the random spherical coding scheme we rely on is sub-optimal in general, although it can be applied in situations where the model
is unknown at the compressor. Coding schemes with similar properties were studied in the context of indirect source coding under the name compress-and-estimate in [41], [42].
The equivalence between quantization noise and AWGN we provide in this paper is given in terms of the Wasserstein distance between the distributions of these vectors. We refer to [43]–[45] for properties, applications, and the long list of alternative names of the Wasserstein distance. In the context of information theory, the Wasserstein distance has been used to establish consistency of some quantization procedures [46], [47] and to define a class of channels over which communication is possible without assuming synchronization [48], [49]. One of the core results of this paper is a novel coupling of the distributions of Y and Z given X, leading to a bound on the Wasserstein distance between them. This bound, in combination with the fact that the risk of a Lipschitz estimator is continuous with respect to the Wasserstein distance, implies that the risk of such an estimator, when used at the output of a random spherical code, converges to the risk when used at the output of a Gaussian channel.
Our work is similar in spirit to the work of Zamir and Feder [50, Section III], who provide a Gaussian approximation for the quantization error of a random dithered lattice quantizer. In the setting of their paper, the quantization error is independent of the input and distributed uniformly over the basic cell of the lattice. They show that as the dimension n increases, there exists a sequence of lattice quantizers such that the relative entropy between the distribution of the quantization noise and the isotropic Gaussian distribution with matched power is bounded from above by where c is a positive constant. Normalizing by n, they conclude that the relative entropy per dimension converges to zero in the large-n limit. To interpret their results in the setting of this paper, we can use the Gaussian transportation inequality [51], which leads to an upper bound on the 2-Wasserstein distance that is order
where
is the variance of the additive noise. By contrast, for quantization using a random spherical code, our results provide an upper bound on p-Wasserstein distance that is order
. Namely, both bounds are proportional to
but the bound following from [50] is unbounded in the problem dimension.
Another related setting is the problem of channel simulation, the goal of which is to design a random code that induces a particular target distribution between the data and the compressed representation [52]–[54].
The rest of this paper is organized as follows. In Section II we provide our main results on the distributional connection between spherical coding and AWGN. In Sections III and IV we apply these results to parameter estimation and source coding, respectively. Section V provides the proofs of the main results. Concluding remark are provided in Section VI.
The main result of this paper is a comparison between the quantization error under random spherical coding and independent Gaussian noise.
Definition 1 (Random Spherical Code). An (n, M) random spherical code is a collection of M codewords C = {C(1), . . . , C(M)} drawn independently from the uniform distribution on the unit sphere in . The encoder maps
an input vector to the index
of a codeword that maximizes the cosine similarity
Given the index i and knowledge of the codebook C, the decoder outputs the compressed representation
where is a scaling parameter.
Fig. 4. Conceptual 2D description of the coupling in Theorem 1 when the representation sphere is matched to . The quantization error
(Left) is compared to the standard n-dimensional Gaussian noise vector W (Right). The random variable Y is the nearest codeword to x in the random codebook ensemble. The standard Gaussian random variable A is the normalized component of W in the direction of x. The random variable B describes the magnitude of the projection of
-dimensional space orthogonal to x.
Our results focus on the distribution of the compressed representation Y induced by the randomness in the codebook. Note that this distribution is parameterized by the input x and has a density with respect to the surface
measure on the sphere of radius . We also define the maximal cosine similarity according to
It is well known (see e.g., [55]) that the distribution of does not depend on x and is given by
where
and where is the Gamma function.
A. Approximation using AWGN
The fundamental question we address is the extent to which the quantization error can be approximated by an isotropic zero-mean Gaussian noise. To answer this question we introduce the AWGN-corrupted observation
model
Our main results are based on a coupling argument. Specifically, we show that there exists a joint distribution on the pair (Y, Z) under which the distribution of can be described exactly in terms of the tuple
and the magnitude of the input. The proof of the following result is in Section V.
Theorem 1. For any , positive integer M, and real numbers
, there exists a joint distribution on (Y, Z) such that Y has the distribution of the compressed representation of magnitude
obtained from an
(n, M) random spherical code with input , and
where:
• are independent,
• 1),
• B has a chi distribution with degrees of freedom.
Figure 4 provides a conceptual illustration of , and B in the comparison between Y and Z provided in Theorem 1. The proof of this theorem in Section V provides the exact description of these random variables and vectors.
The coupling described in Theorem 1 holds for any choice of the parameters . The next step is to show that the term
in (8) is negligible compared to the magnitude of the quantization error under a proper specification of these parameters. To provide a sense of scale, observe that the error due to AWGN satisfies
where concentrates about
in the large-n limit. For comparison, we consider the upper bound given by
where is an estimate of
and the normalized error term
does not depend on x. In the following we show that can be chosen as a function of
such that the distribution of
is bounded independently of the dimension n.
First we consider the setting where the number of codewords is given by for a fixed bitrate R > 0.
For a given we use the specification
Theorem 2. Suppose that for a fixed bitrate R > 0. Under the specification given in (11), the normalized error term
defined in (10) has a sub-Gaussian distribution with parameters that depend only on the
bitrate R. In particular, there exists a positive number such that
The significance of Theorem 2 is that the distribution of is bounded uniformly with respect to n and thus the term
in (9) is order one. By comparison, the magnitude of the additive noise
scales at rate
. This means that if the estimate of
is accurate in the sense that
as
, then the relative difference between the mismatch
and the quantization error
converges to zero.
Next, we provide a result that holds in the high-rate setting where diverges. This regime requires a more
precise estimate of the max-cosine similarity and we use the specification
Note that is normalized by
instead of n. Further, we will require that the number of codewords is bounded
from below by
for some fixed constant .
Theorem 3. Suppose that for some constant
. Under the specification given in (13), the normalized error term
defined in (10) has a sub-exponential distribution with parameters that depend only on
. In particular, there exists a positive number
such that
Theorem 3 is stronger than Theorem 2 in the sense that the bound holds uniformly for all (n, M) satisfying the constraint . This is important for the high-bitrate setting where
converges to zero. The price that is paid is that the sub-Gaussian tail condition is replaced with the weaker sub-exponential condition. An explicit value for the constant
and its dependence on
can be found in the proof of Theorem 3, which is given in Appendix A.
B. Bounds on Wasserstein Distance
Our results can also be stated in terms of the Wasserstein distance on distributions. The p-Wasserstein distance
between distributions P and Q on is defined by
where the infimum is over all joint distributions on (U, V ) satisfying the marginal constraints and
. For
, the p-Wasserstein distance is a metric on the space of distributions with finite p-th moments.
Theorems 2 and 3 imply upper bounds on the Wasserstein distance between the distribution of the compressed representation obtained using a random spherical code and the distribution of the AWGN-corrupted version of the input.
Theorem 4. Let X be a random vector in with
for some
. Let
be the distribution of the compressed representation of magnitude
obtained from an (n, M) random spherical code with input X and let
be the distribution of
where W is an independent standard Gaussian vector.
(ii) If for some
and
are defined as in (13), then
Proof. The p-th power of the Wasserstein distance is convex in the pair (P, Q) [45, Theorem 4.8], and thus
where and
denote the conditional distributions of Y and Z, respectively. In view of (9) and (12),
it follows that
Combining these displays with Minkowski’s inequality leads to (16). Inequality (17) is obtained in a similar manner from (14).
A useful property of the Wasserstein distance is that it controls the expectations of Lipschitz continuous functions.
Recall that a mapping is Lipschitz continuous if there exists a constant L such that
The infimum over all L is call the Lipschitz constant and is denoted by by . The 1-Wasserstein distance,
which is also known as the Kantorovich–Rubinstein distance, can be expressed equivalently as
where and
. More generally, the p-Wasserstein can be used to bound the difference between p-th moments.
Proposition 5. Let and
be random vectors on
. For any Lipschitz function
,
provided that the expectations exist.
Proof. For any coupling of (U, V ), Minkowski’s inequality and the Lipschitz assumption on f yield
Taking the infimum over all possible couplings leads to one side of the inequality. Interchanging the role of U and V and repeating the same steps gives the other side.
C. Concentration of the Norm
The bounds in Theorems 2 and 3 simplify further when the parameter in (10) is matched to the magnitude of the input. As a specific example, suppose that the data is known to lie on the sphere of radius
and that
. By setting
, we see that there exists a coupling of Y and Z under which the quantization error
is bounded by a sub-Gaussian random variables that is independent of n.
For many applications, however, the assumption that the data lay on a sphere of known radius is too restrictive. Therefore, in this paper, we assume that the data at the input to the compressor is a random vector X in whose magnitude
concentrates about a known value
. This assumption is reasonable for high-dimensional settings where the entries of X are weakly correlated. More generally, there are a number of other approaches that can be used to deal with the fact that
is unknown. One approach is to use additional bits to encode the magnitude of X, as is done in [24]. For example, if
almost surely where
is a known constant, then
bits
are sufficient to encode with absolute error less than
, such that
When n is large, the logarithmic number of bits used to encode the magnitude of X is negligible compared to the nR bits used to encode its direction. An alternative approach is to compare the compressed representation with a noisy version of X after it has been projected onto the unit sphere in . This can be achieved, by setting
and redefining the input to be
such that the magnitude is equal to one almost surely. In both of the approaches described above, the noise variance
is scaled in such a way that the signal-to-noise ratio in the AWGN observation model (7) depends only on (n, R) and is given by
. One may also consider a variable-length coding strategy that adapts the number of bits to the magnitude of X such that the effective noise power is constant and the signal-to-noise ratio is proportional to
. We leave this as a direction for future work.
In this section, we apply our main results to the problem of estimating an unknown parameter vector from a compressed representation of the data X. For each integer n, let
be a family of distributions on
with index set
. For the purposes of exposition we will focus on the squared error loss. Our approach is quite general, however, and can be extended to other loss functions.
An important performance benchmark in estimating from a bitrate-R compressed representation of X is the minimax MSE:
where the minimum is over all encoding functions and decoding functions
with
.
Zhu and Lafferty [24] studied the asymptotic minimax MSE for the Gaussian location model
with the n-dimensional Euclidean ball of radius
, and showed that
Their achievability result is based on random spherical coding while devoting a number of bits sublinear in n to encode the magnitude of the X, as discussed in Section II-C above.
The comparison between quantization error and Gaussian noise in Theorem 4 provides a straightforward method for obtaining non-asymptotic upper bounds on the minimax MSE that can be applied to a large class of models. The basic idea is to study the MSE of Lipschitz estimators applied to the AWGN-corrupted data. We use the following assumption, which says that concentrates on a spherical shell whose radius does not depend on
.
Assumption 1 (Concentration of Magnitude). There exists a sequence of positive numbers such that
Assumption 1 provides a way to formulate many cases of interest in terms of the radius of the shell and its width
.
The next result uses this assumption to bound the difference in root MSE between an estimator applied to the compressed representation Y and the same estimator applied to the AWGN-corrupted version Z.
Theorem 6. Let be a sequence of models that satisfies Assumption 1. Given
, let Y be the output of an
random spherical code with input X and output scaling
and let
where
is independent of X and
is according to (11). For any Lipschitz estimator
, the
(26) where C is a constant that depends only on the bitrate R and
. Furthermore, for all t > 0, the
minimax MSE satisfies
where we have used the fact that f is the composition of with the 1-Lipschitz function
, and thus
. By Theorem 4 and Assumption 1, the Wasserstein distance is upper bounded by
,
where is given in (11). It follows that
where only depends on R. The upper bound on the minimax MSE follows from the inequality
One takeaway from Theorem 6 is that the MSE obtained from the compressed representation is asymptotically equivalent to that of the AWGN-corrupted observation, provided that the Lipschitz constant of the estimator is small enough. To gain insight into the interplay between the Lipschitz constant of the estimator, the magnitude of the data, and the typical size of the squared error, it is useful to consider some concrete examples.
A. Gaussian Location Model
For our first example, we consider the Gaussian location model . Assume that the parameter set
is a subset of the spherical shell:
As an intuition for this notation, one may think about as an estimate for the magnitude of
and
as the uncertainty in this estimate. For example, if the entries of
are sampled independently from a sub-Gaussian distribution with a second moment
, then
is sub-Gaussian [56, Thm 3.1.1]. In this case, there exists a constant C independent of n such that
for
with probability at least
.
The next statement says that under a Gaussian location model, X concentrates whenever is restricted.
Proposition 7. Consider the model with
. Assumption 1 is satisfied with
and
.
The assumption implies
Consequently, the second term on the right-hand side of (29) is upper bounded by . For the first term, we write
where we have used the fact that has a non-central chi-squared distribution with n degrees of freedom
and non-centrality parameter
In this setting, the AWGN-corrupted data Z, corresponding to a bitrate R and magnitude , is drawn according to the Gaussian location model whose noise variance depends on the original noise level
and the bitrate R:
The MSE in the Gaussian location model has been studied extensively. If we restrict our attention to linear estimators
of the form then a standard calculation (see e.g., [26, Ch. 4.8]) gives
where the minimum over is attained at
. By expressing the right-hand side as a function of R and combining with Theorem 6, we obtain a non-asymptotic upper bound on the minimax MSE.
Proposition 8. Consider the model with
. Let Y be the output of a bitrate-R random
spherical code applied to X and scaled to the radius. Then
where C is a constant that depends on but not n.
Proof. We have , and
for the linear estimator
. Following Proposition 7, Assumption 1 is satisfied with
. We use Theorem 6 with
and
.
It follows that there exists a constant c such that
Grouping factors leads to (32).
Since
Proposition 8 recovers parts of the results in [24] by showing that there exists a bitrate-R coding scheme with minimax risk approaching (24). Furthermore, Proposition 8 shows that the minimax risk under such scheme converges at rate , which is faster than the convergence rate established in [24] by a factor of
.
More generally, we can also provide bounds for non-linear estimators. The case of a k-sparse parameter vector
can be modeled as
where denotes the number of nonzero entries in
. A great deal of work has studied the MSE of the soft-
thresholding estimator
in the model (30) [25], [26]. Specifically, we have
where, for ,
where and
are the cumulative and density functions of the standard Gaussian distribution, respectively. Let
be the minimizer in (34).
Proposition 9. Let where
. Let Y be the output of a bitrate-R
random spherical code applied to X and scaled to the radius. Then
where C is a constant that depends on but neither k or n.
Proof. For any we have
. Equation (36) follows from Theorem 6 by using
and grouping
factors.
Corollary 10. Assume that . The bitrate-R constrained minimax risk over
, with
, satisfies
B. Linear Model with IID Matrix
For the next example, we consider the linear model
where A is a known matrix,
is an unknown d-dimensional vector, and
a known noise variance. In this setting, the AWGN-corrupted version of X given by
with
corresponds to a linear
model with larger noise variance, that is
We study the approximate message passing (AMP) algorithm [28] to estimate from Z. AMP is an iterative algorithm that can be defined by a sequence of scalar denoising functions
with
that are assumed to be Lipschitz continuous, and hence differentiable almost everywhere. Starting with an initial points
and , a sequence of estimates
is generated according to
where is applied comopontwise and
with
.
The main result of [28], [57] says that the MSE of each iteration of AMP can be characterized precisely in the high-dimensional limit when A is a realization of a random matrix with i.i.d. zero-mean Gaussian entries. To formally state and use this result, we need to the following assumptions:
Assumption 2. is a sequence of
-dimensional vectors such that
as n goes to infinity. The empirical distributions of
, i.e., the probability distribution that puts a point mass
at each of the
entries of
, converges weakly to a distribution
on R with finite second moment
. Furthermore,
converges to
as
.
Assumption 3. is a sequence of models defined by
, where the entries of A are i.i.d. N(0, 1/n).
For a fixed n, we further consider a sequence of estimators for defined as follows:
Assumption 4. is a sequence of scalar, Lipschitz continuous, and differentiable denoisers
. For every
, the approximate message-passing (AMP) estimator
is defined as the results of t iterations of (39) and (40).
The characterization of the MSE of the estimator in the high-dimensional limit is given by the state evolution recursion. This recursion is defined in terms of a distribution
on R, sampling ratio
, and initial noise
level , as
where and
. Finally, define
where is given by t iterations of (41). Under assumptions 2-4 above, [57, Thm. 1] implies that
Combining this result with Theorem 6, we conclude the following:
Theorem 11. Consider a sequence of problems satisfying Assumptions 2 and 3. Let Y be the output of an random spherical code applied to X with radius
for some R > 0. Let
be an estimator satisfying Assumption 4.
Then
almost surely, where
Proof. Set and
. We first show that X and
satisfy Assumption 1. Since A has i.i.d. entries N(0, 1/n), then
. Using similar arguments as in Proposition 7,
we get
Assumption 2 implies that with
. We conclude that
and thus Assumption 1 is satisfied for some . Let
. In Appendix E we show that
Using the triangle inequality once with the last display, Theorem 6 implies that there exists C, that depends only
with .
For the second application, we consider an indirect source coding setting where the observed data is a degraded version of the realization of an information source. The goal is to compress this version at bitrate R and recover the source realization. Traditionally, both the encoder and decoder are designed with full knowledge of the joint distribution of the source and the data [19]. In this section, we study an encoding-decoding scheme where the encoder uses a random spherical code and the decoder is described by a Lipschitz estimator, which may be designed with partial or full knowledge of the distribution of the source and the data. Leveraging the results in Section II, we show that the asymptotic performance can be described in terms of an AWGN model.
Throughout this section, the source and the data are modeled as a stochastic process . The first n terms in this sequence are denoted by
and
. We focus on the squared
error loss (or distortion function)
and assume the following regularity condition:
Assumption 5. The process is stationary and second-order ergodic with finite second moments. In particular, this means that the empirical second moments converge in mean:
A. The Indirect Distortion-Rate Function
We begin by reviewing some basic properties of the indirect distortion-rate function, which describes the fundamental tradeoff between the bitrate R and the expected distortion in our source coding setting. For each
problem of size n, the indirect distortion-rate function is given by
where the minimum is over all encoding functions and decoding functions
with
. The standard source coding setting corresponds to the special case where the source equals the data. When the source and data are stationary, as we assume in this paper,
is sub-additive in n, and
the limit
is well-defined [58, Lem. 10.6.2].
For some classes of processes, D(R) can be expressed equivalently in terms of an optimization problem over a
family of probability distributions subject to a mutual information constraint [18], [19]. Specifically, we have
where the minimum is over all joint distributions on such that
satisfy their marginal constraints,
forms a Markov chain, and
. For example, a representation of the form (48) exists for memoryless processes [19], [59] and in cases where the direct (standard) distortion-rate function of the sequence of random vectors
has a representation of the form (48) by setting
[60, Ch. 3.2].
There are a few cases where the distortion-rate function has simple closed-form expressions. For example if are i.i.d. from bivariate Gaussian distribution with zero mean, then the distortion-rate function is given
by where
This characterization was obtained in [18] and also [20]. Note that the limiting case corresponds to the minimum MSE in estimating
from
. Moreover, for the direct source coding problem where
is equal to
, this expression reduces to the standard distortion-rate function for an i.i.d. Gaussian source,
.
B. Achievability using Spherical Coding
We now consider the distortion that can be achieved when is compressed using a random spherical code. For each problem of size n, let
be the output of a bitrate-R random spherical code with input
and squared magnitude
. The distortion-rate function associated with random spherical coding and estimator
is defined as
where the expectation is with respect to the joint distribution of . Under the squared error distortion, the minimum with respect to f is achieved by the conditional expectation
. We note that this formulation of the distortion-rate function does not necessarily describe the optimal performance that is possible using a random spherical code, because the estimation stage is based only on the compressed representation
and does not use any other information about the realization of the codebook.
Following the central theme of this paper, our results are described in terms of an AWGN counterpart to the
distortion-rate function. Given noise variance , define the sequence
by
where is an independent standard Gaussian noise. The MSE associated with an estimator
is
defined by
The minimum over f is attained by the conditional expectation and is denoted by
. Stationarity of the sequence
implies that
is sub-additive
in n, and thus the following limit is well-defined
We refer to as the minimum MSE function associated with the AWGN model. The next result establishes the formal equivalence between the distortion-rate function associated with random spherical coding and
. The proof is based on the Gaussian approximation of quantization error in Theorem 4 as well as some further properties of the AWGN model.
Theorem 12. Suppose that is a random process satisfying Assumption 5. Let
be a sequence of estimators
satisfying
and
for all n where L, C are positive
constants. Then, for each R > 0,
where . Furthermore, there exists a sequence of estimators
such that
By Theorem 4, the normalized Wasserstein distance can be upper bound as
where is a constant that depends only on R. The second term in this bound converges to zero at a rate
. Combining the inequality
with the assumption that
is second order ergodic, one finds
that the first term also converges to zero. Putting everything together, we conclude that
To prove that this comparison holds without the square roots, it is sufficient to show that and
are bounded uniformly with respect to n. To this end, we can use the triangle inequality and the assumptions on
to write:
Combining this bound with the assumptions on and
establishes that
is bounded uniformly, and the same approach also works for
.
To prove the second part, we will show that for each , there exists a sequence of estimators
satisfying
and
. The existence of the limit in the definition of
means that for each
, there exists an integer N such that
for all
. By Lemma 21 in the Appendix, there exists a Lipschitz continuous function
such that
. For
, let
be defined by applying g to the first
successive length-N blocks of
and
setting any remaining entries to zero. Then, we have and
Putting everything together, we have for all n large enough. As
can be chosen arbitrarily small, the proof is complete.
The significance Theorem 12 is that it provides a link between the problem of estimation from compressed data, which is often difficult to study directly, and the better-understood problem of estimation in Gaussian noise. We emphasize that the assumptions on the source and data are quite general, particularly in comparison to many of the existing results in the literature.
Compared to optimal encoding schemes that attain the indirect distortion-rate function D(R), a useful property of random spherical coding is that it can be implemented without any knowledge of the underlying source distribution. Therefore, the coding scheme described in this paper can be employed in typical data acquisition situations where the distribution of the data and the source of interest is learned after the data are collected and quantized.
C. Universality of Linear Estimation
We now consider the performance of linear estimators. Given a bitrate R > 0, define the scalar
A standard calculation reveals that under the AWGN model, the MSE of the linear estimator is
independent of the problem dimension and is given by
where we recall that of (49) is the distortion-rate function associated with a zero-mean Gaussian source. In view of Theorem 12, this correspondence between the Gaussian distortion-rate function and the MSE of linear estimators in the AWGN model implies an achievable result for random spherical codinig.
Proposition 13. Let be a process satisfying Assumption 5. For each integer n, let
be the output
of a bitrate-R random spherical code with input and squared magnitude
. Then,
where is given by (58).
Applied to the special case of direct source coding , Proposition 13 recovers the results in [33] and [35], which showed that squared error distortion of a (properly scaled) random spherical code depends only on the second-order statistics of the source and is equal to the Gaussian distortion-rate function. The contribution of Proposition 13 is to show that this result carries over naturally to the indirect source coding setting. Moreover, if
are i.i.d. zero-mean Gaussian, then we have the equivalence:
We note that, in general, codebooks approaching the optimal trade-off between bitrate and MSE described by D(R) depend on the joint distribution of . This is because such codebooks essentially encode the sequence obtained by estimating
from
[21], [41], i.e., estimation precedes encoding in this case. When
and
are i.i.d. and jointly Gaussian, this estimation is obtained by multiplying
by
, and there is essentially no difference if this multiplication is performed pre- or post-encoding. To summarize, for i.i.d. Gaussian and zero mean
, the equality
is due to two factors: (1) The optimal estimator is a scalar multiple of the data, and (2) random spherical coding is optimal for encoding Gaussian sources.
D. Non-linear Estimation
Next, we consider the performance of non-linear estimators when the source and the data are non-Gaussian. Suppose that the source and the data are memoryless, that is the pairs are i.i.d. from a distribution
with finite second moments. Under this assumption, the indirect distortion-rate function D(R) can be expressed as
where the minimum is over all distributions on such that
, and
. Noting that
where can be approximated numerically using [61].
In the setting of the AWGN model, the memoryless assumption means that the problem of estimating from
decouples into n independent estimation problems, and the minimum MSE function is given by
where and
are independent. This expression can be approximated numerically using standard techniques.
An interesting special case of the indirect source coding problem occurs when the data is an AWGN corrupted
version of the source, that is
with independent of
. In this case, the Gaussian noise in the data can be combined with the
independent Gaussian noise in the AWGN model such that
where independent of
. In Figure 5, we provide a comparison of the indirect distortion-rate function D(R) and the upper bound on the distortion obtained using random spherical coding
in the setting where U is uniform on
and X is drawn according to (64). For comparison, we also plot the upper bound
corresponding to linear estimation, as well as the asymptotes of all MSE functions as the noise variance
vanishes.
The proof of Theorem 1 requires several lemmas.
Lemma 14. Suppose that V is distributed uniformly on the unit sphere in with
. For any
,
the distribution on V can be decomposed as
Fig. 5. Mean square error (MSE) in estimating an i.i.d. signal equiprobable on from a bitrate-R encoding of its AWGN-corrupted version. Left Panel: MSE versus noise variance
with a fixed encoding bitrate R = 1. Right Panel: MSE versus bitrate R with noise variance
is achievable using a random spherical code followed by a scalar Bayes estimator.
is achievable using random spherical coding followed by a scalar linear estimator. D(R) is the indirect distortion-rate function corresponding to the optimal encoding scheme. The dashed lines indicate the asymptotic MSE as
. Also shown is
, which is the minimal MSE in estimating the signal from its corrupted version corresponding to the limit
where is a random variable supported on
with complementary cumulative distribution function
of (6), and H is an independent random vector distributed uniformly on the set
Proof. By the orthogonal invariance of the distribution on V , we may assume without loss of generality that x is a unit vector of the form x = (1, 0, . . . 0). Then, and
. The joint distribution of (G, H) follows from the joint distribution on the entries of a random spherical vector [55, Eq. (3)].
Lemma 15. For , let Y be the output of an (n, M)-random spherical code with input
and
magnitude . The distribution of Y can be decomposed as
where has the cumulative distribution function (5), and H is an independent random vector distributed
uniformly on the set and
.
Proof. For each code word C(i) we apply the decomposition in Lemma 14 to obtain
where is the cosine similarity of the i-th codeword. Recall that the index
corresponds to the code word that maximizes the cosine similarity
. Therefore, the distribution of S follows from the fact that G(1), . . . G(M) are i.i.d. with complementary cumulative distribution function given by (6). Furthermore, because
depends only on the terms G(1), . . . G(M), it follows from Lemma 14 that
is independent of
and uniform on the subset of the unit sphere that is orthogonal to x. Noting that
completes the proof.
Lemma 16. Suppose that W is a standard Gaussian vector on with
. For any
the random
vector can be decomposed as
where (A, B, H) are independent, has a standard Gaussian distribution,
has a chi distribution with
degrees of freedom, and H is distributed uniformly on the set
and
.
Proof. By the orthogonal invariance of the Gaussian distribution on W, we may assume without loss of
generality that x is a unit vector of the form x = (1, 0, . . . , 0). Letting , and
yields
. By construction, A is a standard Gaussian variable that is independent of
. The distribution of (B, H) follows from the fact that
is the polar decomposition of the
-dimensional standard Gaussian vector
.
Using the characterizations of Y and Z given in Lemma 15 and Lemma 16, respectively, we see that for every
, there exists a coupling on (Y, Z) such that
where are independent. By the orthogonality of x and H, the magnitude of the difference between Y and Z depends only on the tuple
and is given by (8).
We considered the problem of estimating an underlying signal or parameter from the lossy compressed version of another high dimensional signal. For compression codes defined by a random spherical code of bitrate R, we showed that the distribution of the output codeword is close in Wasserstein distance to the conditional distribution of the output of an AWGN channel with SNR . This equivalence between the noise associated with lossy compression and an AWGN channel allows us to adapt existing techniques for inference from AWGN-corrupted measurements to estimate the underlying signal from the compressed measurements, as well as to characterize their asymptotic performance.
We demonstrated the usefulness of this equivalence by deriving novel expressions for the achievable risk in various source coding and parameter estimation settings. These include bitrate-constrained sparse parameter estimation using soft thresholding, bitrate-constrained parameter estimation in high-dimensional linear models, and indirect source coding with linear and non-linear decoders. In each of these settings, our results yielded achievable MSE and provided the equivalent noise level required to tune the estimator to attain this MSE.
We believe that the characterization of lossy compression error developed in this paper can be useful in numerous important cases aside from the ones we explored. Examples of such cases include hypothesis testing based on compressed data, signal estimation in distributed lossy compression settings, and the study of convergence rates and accuracies of first-order optimization procedures employing gradient compression.
The authors wish to thank Cynthia Rush, David Donoho, and Robert Gray for helpful discussions and comments. We are also grateful for the two anonymous reviewers that encouraged us to derive a much stronger version of the main results than we initially reported. The work of G. Reeves was supported in part by funding from the NSF under Grant No. 1750362. The work of A. Kipnis was supported in part by funding from the Koret Foundation and the NSF under Grant No. DMS-1816114.
The proof of Theorems 2 and 3 require a characterization of the moments of the random variable
where are deterministic parameters and
are independent random variables whose distributions are described in Theorem 1.
Given let
Evaluating with these values and then using the triangle inequality as well as basic trigonometric identities leads
to
where
The term is sub-Gaussian with mean and variance parameter independent of n. An estimate for its sub-Gaussian constant is provided in Lemma 20. For the term,
, we use the following result, which is proved in Appendix B.
Lemma 17. Suppose that for
. Then, for
,
where C is a numerical constant.
Proof of Theorem 2. In view of (71) and the fact that is sub-Gaussian with a constant that does not depend on on n all that remains is to establish the desired upper bound on the moments of
. Note that if
then this
term is bounded almost surely according to
where depends only on R. The remainder of the proof focuses on the case
.
Recall that the specification in (11) corresponds to the choice . Define
For it can be verified that there exists a unique value
such that
and
. Noting the the sin function is Lipschitz and non-decreasing on
we can write
The second term in (77) is deterministic and satisfies
To bound the moments of the first term in (77) we can use Lemma 17. Finally, recalling that , it
follows that
In view of (74) and n > p, it follows that
Finally, for the term , recalling that
, we see that for
,
Proof of Theorem 3. The specification given (13) corresponds to the choice
Under the assumption , there exists
such that
. Using the same
By Lemma 17 it follows that
where is a numerical constant. Since the secant function is non-decreasing on
we have
,
and so the first term is bounded from above by for some number
. The second term satisfies
Let integers n, M and be such that
. Recall that the goal is to bound the absolute moments of the random variable
where
is drawn according to (5). We consider two cases. First, on the event
, we can use the trivial upper bound of one. Note that
and so the probability of this event is
. Alternatively, on the event
, we use basic trigonometric identities to
write
Noting that leads to the upper bound
where
Combining these two cases yields
The remaining step in the proof is to provide an upper bound moments of . We need the following bounds on the function
defined in (6).
Lemma 18. For integer and
,
where
is the mean of the chi distribution with n degrees of freedom.
Proof. Making the change of variables and and using the relation
leads to
Noting that the denominator in the integral satisfies and then integrating gives the double inequality:
To prove an upper bound with s replaced by in the denominator, observe that for
,
Meanwhile, for n = 2, direct calculation reveals
Combining these upper bounds completes the proof.
Lemma 19. For integers and
,
Proof. Let . Expressing the event
in terms of
, we can write
Here, the third step follows because and the last step is due to the basic inequality
for
and
. Combining this inequality with the upper bound on
in (94) gives
(97). For the complementary event, we have
where we have used the basic inequality for all
. Combining with the lower bound on
in (94) gives (98).
To bound the distribution of we consider two cases. First, conditional on the event
, we can
write
where we used the basic inequality for x > 0. In view of (97) and the assumption M =
, it follows that for
,
where the last step uses the lower bound .
Next, conditional on the event , we can write
where we used the basic inequality for all
and
. In view of (98) and the assumption
, it then follows that for
,
where the last step uses the upper bound .
With (100) and (101) in hand, we can upper bound the moments of according to
where is the gamma function. Combining this inequality with (93) and using the basic upper bound
completes the proof of Lemma 17.
where .
Define . Minkowski’s inequality implies
Recall that of (95) is the mean of the chi-distribution with n degrees of freedom. By Jensen’s inequality,
can be upper bounded as
where we used that . To bound the deviation about the mean, we use the fact that B can be expressed as
where W is an
-dimensional vector with i.i.d. standard Gaussian entries. This allows us to write
, where
is given by
. The function
is 1-Lipschitz continuous because it is the composition of 1-Lipschitz functions. Therefore, we can apply the Tsirelson-Ibragimov-Sudakov Gaussian concentration inequality (see e.g., [56, Theorem 5.2.2]) to conclude that
is sub-Gaussian with variance parameter one. Consequently
Lemma 21. Suppose that (U, X) are random vectors in with finite second moments. Let
where
is known and
is independent Gaussian noise. Define
. For each
there exists a number L and estimator
with
such that
Proof. Given T > 0, let B be a binary random variable that is equal to zero if and one otherwise. Define g(z, b) = E[U | Z = z, B = b] to be the conditional expectation given (Z, B) and let f(z) = g(z, 0). Two
different applications of the law of total variance yields
Meanwhile, noting that f(Z) = g(Z, B) whenever B = 0, we have
where the second step follows from the triangle inequality and that fact that f(y) lies in a Euclidean ball of radius
T . Combining the above displays with the inequality leads to
By the assumption that has finite second moment, this upper bound converges to zero as T increases. Thus, for each
, there exists T large enough such that (103) holds.
Next, we will verify that f has a finite Lipschitz constant. Lemma 22 below implies that the Jacobian of f is
given by
By the Cauchy-Schwarz inequality and the definition of B, it follows that , uniformly for all z, and thus
Lemma 22. Let X be a n-dimensional random vector with , where
is the standard Gaussian
density in n dimensions, and let . Let
be a measurable function such that
is defined for any . The Jacobian of
is given by
Proof of Lemma 22: Set , and
. From Bayes rule
we have
It follows from [62, Thm. 2.7.1] that we may differentiate with respect to within the expectation. We get
The last transition implies
for any constant . It follows that
Proposition 23. Let be a random matrix with i.i.d. entries N(0, 1/n). Assume that
. Denote by
the result of t iterations of AMP using a sequence of local non-linearity functions
as in (41) and set
. If
is
-Lipschitz for k = 1, . . . , t, then with probability one there exists
such that
.
Proof. Use the tail bound on the maximal eigenvalue of a random matrix with sub-Gaussian entries from [56, Thm
4.4.5] to deduce that there exists a constant c, independent of n, such that
For , define the event
Using , the Borel-Cantelli Lemma applied to the sequence
implies that the event
occurs with probability one. Conditioning on G and given such , we consider the t-th iteration of AMP for reconstructing
from
as given by (39) and (40). For
applied element-wise to vectors
,
we have
and
For , denote by
and
the result of applying t iterations of (40) to x. We have
For t = 1, we have
and for the second inequality we take . Assume now that for all
, there are
and
From (104), (106), and (107), we obtain
From (105), (106), and (107), we obtain
It follows that both (106) and (107) hold with k = t.
We have shown that with probability one there exists such that, for each
, there exists
for which
It follows that for each .
[1] A. Kipnis and G. Reeves, “Gaussian approximation of quantization error for estimation from compressed data,” in 2019 IEEE International Symposium on Information Theory (ISIT), July 2019, pp. 2029–2033.
[2] R. Gray and D. Neuhoff, “Quantization,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, Oct 1998.
[3] W. R. Bennett, “Spectra of quantized signals,” Bell System Technical Journal, vol. 27, no. 3, pp. 446–472, 1948.
[4] D. Marco and D. Neuhoff, “The validity of the additive noise model for uniform scalar quantizers,” IEEE Transactions on Information Theory, vol. 51, no. 5, pp. 1739–1755, May 2005.
[5] R. Zamir and M. Feder, “Rate-distortion performance in coding bandlimited sources by sampling and dithered quantization,” IEEE Transactions on Information Theory, vol. 41, no. 1, pp. 141–154, Jan 1995.
[6] D. H. Lee and D. L. Neuhoff, “Asymptotic distribution of the errors in scalar and vector quantizers,” IEEE Transactions on Information Theory, vol. 42, no. 2, pp. 446–460, 1996.
[7] I. Kontoyiannis and R. Zamir, “Mismatched codebooks and the role of entropy coding in lossy data compression,” IEEE Transactions on Information Theory, vol. 52, no. 5, pp. 1922–1938, 2006.
[8] J. N. Tsitsiklis, “Decentralized detection by a large number of sensors,” Mathematics of Control, Signals, and Systems (MCSS), vol. 1, no. 2, pp. 167–182, 1988.
[9] Z. Zhang and T. Berger, “Estimation via compressed information,” IEEE Transactions on Information Theory, vol. 34, no. 2, pp. 198–211, 1988.
[10] T. Han and S. Amari, “Statistical inference under multiterminal data compression,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2300–2324, Oct 1998.
[11] J. Steinhardt and J. Duchi, “Minimax rates for memory-bounded sparse linear regression,” in Conference on Learning Theory, 2015, pp. 1564–1587.
[12] Y. Zhu and J. Lafferty, “Quantized minimax estimation over sobolev ellipsoids,” Information and Inference: A Journal of the IMA, vol. 7, no. 1, pp. 31–82, 2017.
[13] A. Kipnis and J. C. Duchi, “Mean estimation from one-bit measurements,” CoRR, vol. abs/1901.03403, 2019. [Online]. Available: http://arxiv.org/abs/1901.03403
[14] Y. Han, P. Mukherjee, A. Ozgur, and T. Weissman, “Distributed statistical estimation of high-dimensional and nonparametric distributions,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 506–510.
[15] Y. Dagan and O. Shamir, “Detecting correlations with little memory and communication,” in Conference On Learning Theory. PMLR, 2018, pp. 1145–1198.
[16] B. Szabo and H. van Zanten, “Adaptive distributed methods under communication constraints,” arXiv preprint arXiv:1804.00864, 2018.
[17] L. P. Barnes, Y. Han, and A. Ozgur, “Lower bounds for learning distributions under communication constraints via fisher information,” Journal of Machine Learning Research, vol. 21, no. 236, pp. 1–30, 2020.
[18] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Transactions on Information Theory, vol. 8, no. 5, pp. 293–304, 1962.
[19] T. Berger, Rate-distortion theory: A mathematical basis for data compression. Englewood Cliffs, NJ: Prentice-Hall, 1971.
[20] J. Wolf and J. Ziv, “Transmission of noisy information to a noisy receiver with minimum distortion,” IEEE Transactions on Information Theory, vol. 16, no. 4, pp. 406–411, 1970.
[21] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Transactions on Information Theory, vol. 26, no. 5, pp. 518–521, 1980.
[22] A. Dembo and T. Weissman, “The minimax distortion redundancy in noisy source coding,” Information Theory, IEEE Transactions on, vol. 49, no. 11, pp. 3020–3030, 2003.
[23] T. Weissman, “Universally attainable error exponents for rate-distortion coding of noisy sources,” IEEE Transactions on Information Theory, vol. 50, no. 6, pp. 1229–1246, 2004.
[24] Y. Zhu and J. Lafferty, “Quantized estimation of gaussian sequence models in euclidean balls,” in Advances in Neural Information Processing Systems, 2014, pp. 3662–3670.
[25] D. L. Donoho and I. M. Johnstone, “Minimax risk over lp-balls for lp-error,” Probability Theory and Related Fields, vol. 99, no. 2, pp. 277–303, Jun 1994.
[26] I. Johnstone, “Gaussian estimation: sequence and multiresolution models,” Unpublished manuscript, 2011.
[27] Y. C. Eldar and G. Kutyniok, Compressed sensing: theory and applications. Cambridge University Press, 2012.
[28] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proceedings of the National Academy of Sciences, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[29] V. K. Goyal, A. K. Fletcher, and S. Rangan, “Compressive sampling and lossy compression,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 48–56, 2008.
[30] R. G. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters, “Exponential decay of reconstruction error from binary measurements of sparse signals,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 3368–3385, 2017.
[31] A. Kipnis, G. Reeves, Y. C. Eldar, and A. J. Goldsmith, “Fundamental limits of compressed sensing under optimal quantization,” in Information Theory (ISIT), 2017 IEEE International Symposium on, 2017.
[32] M. Leinonen, M. Codreanu, M. Juntti, and G. Kramer, “Rate-distortion performance of lossy compressed sensing of sparse sources,” IEEE Transactions on Communications, vol. 66, no. 10, pp. 4498–4512, Oct 2018.
[33] D. Sakrison, “A geometric treatment of the source encoding of a Gaussian random variable,” IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 481–486, 1968.
[34] A. D. Wyner, “Communication of analog data from a Gaussian source over a noisy channel,” The Bell System Technical Journal, vol. 47, no. 5, pp. 801–812, 1968.
[35] A. Lapidoth, “On the role of mismatch in rate distortion theory,” IEEE Transactions on Information Theory, vol. 43, no. 1, pp. 38–47, 1997.
[36] L. Zhou, V. Y. F. Tan, and M. Motani, “Refined asymptotics for rate-distortion using Gaussian codebooks for arbitrary sources,” IEEE Transactions on Information Theory, vol. 65, no. 5, pp. 3145–3159, May 2019.
[37] P. Delsarte, J.-M. Goethals, and J. J. Seidel, “Spherical codes and designs,” in Geometry and Combinatorics. Elsevier, 1991, pp. 68–93.
[38] T. M. Cover and J. A. Thomas, Elements of information theory (2. ed.). Wiley, 2006.
[39] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Transactions on Information Theory, vol. 33, no. 6, pp. 759–772, 1987.
[40] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower bounds for distributed statistical estimation with communication constraints,” in Advances in Neural Information Processing Systems, 2013, pp. 2328–2336.
[41] A. Kipnis, S. Rini, and A. J. Goldsmith, “The rate-distortion risk in estimation from compressed data,” IEEE Transactions on Information Theory, vol. 67, no. 5, pp. 2910–2924, 2021.
[42] A. Kipnis, A. J. Goldsmith, and Y. C. Eldar, “The distortion-rate function of sampled Wiener processes,” IEEE Transactions on Information Theory, vol. 65, no. 1, pp. 482–499, Jan 2019.
[43] S. T. Rachev and L. Rüschendorf, Mass Transportation Problems: Volume I: Theory. Springer Science & Business Media, 1998, vol. 1.
[44] L. Ambrosio, Y. Brenier, G. Buttazzo, and C. Villani, Optimal Transportation and Applications: Lectures given at the CIME Summer School held in Martina Franca, Italy, September 2–8, 2001. Springer, 2003.
[45] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.
[46] D. Pollard, “Quantization and the method of k-means,” IEEE Transactions on Information theory, vol. 28, no. 2, pp. 199–205, 1982.
[47] T. Linder, “Lagrangian empirical design of variable-rate vector quantizers: consistency and convergence rates,” IEEE Transactions on Information Theory, vol. 48, no. 11, pp. 2998–3003, 2002.
[48] R. Gray and D. Ornstein, “Block coding for discrete stationary d-continuous noisy channels,” IEEE Transactions on Information Theory, vol. 25, no. 3, pp. 292–306, May 1979.
[49] R. Gray, D. Ornstein, and R. Dobrushin, “Block synchronization, sliding-block coding, invulnerable sources and zero error codes for discrete noisy channels,” The Annals of Probability, pp. 639–674, 1980.
[50] R. Zamir and M. Feder, “On lattice quantization noise,” IEEE Transactions on Information Theory, vol. 42, no. 4, pp. 1152–1159, Jul 1996.
[51] M. Raginsky, I. Sason et al., “Concentration of measure inequalities in information theory, communications, and coding,” Foundations and Trends® in Communications and Information Theory, vol. 10, no. 1-2, pp. 1–246, 2013.
[52] P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, “The communication complexity of correlation,” in Twenty-Second Annual IEEE Conference on Computational Complexity (CCC’07). IEEE, 2007, pp. 10–23.
[53] P. Cuff, “Distributed channel synthesis,” IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7071–7096, 2013.
[54] C. T. Li and A. El Gamal, “A universal coding scheme for remote generation of continuous random variables,” IEEE Transactions on Information Theory, vol. 64, no. 4, pp. 2583–2592, April 2018.
[55] A. J. Stam, “Limit theorems for uniform distributions on spheres in high-dimensional euclidean spaces,” Journal of Applied Probability, vol. 19, no. 1, pp. 221–228, Mar. 1982.
[56] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018, vol. 47.
[57] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Transactions on Information Theory, vol. 57, no. 2, pp. 764–785, Feb 2011.
[58] R. M. Gray, Entropy and information theory. Springer-Verlag, 2011, vol. 1.
[59] V. Kostina and S. Verdú, “Nonasymptotic noisy lossy source coding,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6111–6123, Nov 2016.
[60] A. Kipnis, “Fundamental distortion limits of analog-to-digital compression,” Ph.D. dissertation, Stanford University, 2017.
[61] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Transactions on Information Theory, vol. 18, no. 4, pp. 460–473, Jul 1972.
[62] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006.