b

DiscoverSearch
About
My stuff
Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions
2020·arXiv
Abstract
Abstract

The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any label-invariant distribution collection; c) serves as the limit of profile compression, for which we derive optimal near-linear-time block and sequential algorithms. To further our understanding of pro-file entropy, we investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.

Recent research in statistical machine learning, ranging from neural-network training and online learning, to density estimation and property testing, has advanced evaluation criteria beyond worst-case analysis. New performance measures apply more refined metrics relating the algorithm’s accuracy and efficiency to the problem’s inherent structure.

Consider for example learning an unknown discrete distribution from its i.i.d. samples. Classical worst-case analysis states that in the worst case, the number of samples required to estimate a distribution to a given KL-divergence grows linearly in the alphabet size.

However, this formulation is pessimistic. Distributions are rarely the worst possible, and many practical distributions can be estimated with significantly smaller samples. Furthermore, once the sample is drawn, it reveals the distribution’s complexity and hence the hardness of the learning task.

Going beyond worst-case analysis, we design an adaptive learning algorithm whose theoretical guarantees vary according to the problem’s simplicity. For example, Orlitsky and Suresh [2015] recently proposed an estimator that instance-by-instance achieves nearly the same performance as a genie algorithm designed with prior knowledge of the underlying distribution.

We introduce profile entropy, a fundamental measure for the complexity of discrete distributions, and show that it connects three vital scientific tasks: estimation, inference, and compression. The resulting algorithms have guarantees directly relating to the data profile entropy, hence also adapt to the intrinsic simplicity of the tasks at hand.

The next subsections, formalize the relevant concepts and present relevant prior works.

1.1 Sample Profiles and Their Entropy

Consider an arbitrary sequence  xnover a finite or countably infinite alphabet X. The multiplicity µy(xn)of a symbol  y ∈ Xis the number of times y appears in  xn. The prevalence of an integer  µis the number  ϕµ(xn)of symbols in  xnwith multiplicity  µ. The profile of  xnis the multiset  ϕ(xn)of multiplicities of the symbols in  xn. We refer to it as a profile of length n.

The number D(S) of distinct elements in a multiset S is its dimension. For convenience, we also write  D(xn)for profile dimension. Note that the dimension of a length-n profile is at most min{√2n, |X|}.

Let  ∆Xbe the collection of distributions over X, and p be an arbitrary distribution in  ∆X. The profile  Φnof an i.i.d. sample  Xn ∼ pis a random variable whose distribution depends on only p and n. We therefore write  Φn ∼ p, and call  H(Φn)the profile entropy with respect to (p, n). Analogously, we call  Dn := D(Φn), the profile dimension associated with (p, n) and write  Dn ∼ p. Due to the dependence among multiplicities, the distributions of  Φnand  Dnare rather complex in general. To obtain clean expressions, we can adopt the standard Poisson sampling technique and make the sample size a Poisson variable  N ∼ Poi(n), independent of the sample. As an example,

image

where  pxdenotes the probability of symbol x assigned by p. Note that sometimes we also write p(x) instead of  pxfor notational convenience. Despite the complex landscape of statistical dependency, in Theorem 1, we show that  Dn ∼ pand  H(Φn ∼ p)are of the same order, with high probability and for every  p ∈ ∆X. In Theorem 6 and 7, we show that  Dn ∼ phighly concentrates around a variant of its expectation. In Section 2.5, we provide a much simpler quantity  HSn (p)that well approximates the expectation variant of  Dn ∼ p. Combined, these three results provide a precise characterization of both profile entropy and dimension. Leveraging this in Section B.2, we derive nearly-tight bounds on the magnitude of profile entropy for several important structural distribution families, including log-concave and power-law.

1.2 Applications and Prior Works

In this section, we present several learning and compression applications in which the profile entropy would play an important role. We also review related prior works with an emphasize on adaptive algorithms, and suppress the discussions on the worst-case analysis for brevity.

Basics and Significance

The profile of a sample corresponds to the empirical distribution of symbols, and reflects the magnitudes of the actual symbol probabilities. Hence, the profile dimension, the number of distinct symbol frequencies, characterizes the variability of ranges the probabilities spread over. The sample profile’s entropy, which by Theorem 1 is of the same order as its dimension, admits the same interpretation.

Intuitively, samples from simple distributions tend to have low profile entropy, such as those from a m-piecewise distribution with small m, or one whose probability masses concentrate over some sparse set. The profile entropy is also likely to decrease as one reduces the sample size, since the sample contains less information regarding the variability of distribution probabilities. See Theorem 9, 14, and 19 for a formal justification of these arguments.

From a statistical perspective, the profile of a sample is a sufficient statistic for estimating the probability multiset and any symmetric functional of the underlying distribution, such as entropy and support size. When we express a profile as a collection of multiplicity-prevalence pairs, the profile dimension is the size of this collection. Being of the same magnitude, the profile entropy is thus the effective size of a natural sufficient statistic for label-invariant inference.

Profile entropy also directly connects to adaptive testing and classification. Such connection arises from computing the profile probability Acharya et al. [2011, 2012], the probability of observing a sample with the given profile. If the profile has entropy of H, we can show that this computation problem has a time complexity of  O(exp(˜Θ(H) log |X|)). The result follows by the equivalence of the problem and computing the permanent of a rank-˜Θ(H)matrix Barvinok [1996], Vontobel [2012, 2014], Barvinok [2016].

Below, we introduce two important applications that are more involved, in which the profile entropy has essential connection to the statistical efficacy of adaptive learning.

Distribution Estimation

Estimating unknown distributions from their samples is a statistical-inference cornerstone, and has numerous applications, ranging from biological studies Armañanzas et al. [2008] to language modeling Chen and Goodman [1999].

A learning algorithm in this setting is often referred to as a distribution estimator, which is a functional  ˆpassociating with every sequence  xnover X a distribution  ˆpxn ∈ ∆X. Given a sample Xn ∼ p, we measure the performance of  ˆpin estimating the (unknown) distribution p with a loss function  ℓ(p, ˆpXn), e.g., the  ℓ1distance and KL divergence.

A classical worst-case type result shows that any estimator that achieves a small  ℓ1loss of  ε > 0over  ∆Xin expectation requires a sample size of  Ω(|X|/ ε2). Recent research further shows that the naive empirical-distribution estimator attains the optimal sample efficiency, to the right constants.

The desire to design more efficient estimators for practical distributions such as Poisson mixtures leads to two adaptive estimation frameworks: structural and competitive.

Structural estimation focuses on distributions possessing a natural structure, such as monotonicity, m-modality, and log-concavity. In many cases including the mentioned, structural assumptions lead to effective estimators that provably perform better on the corresponding distribution classes. See Bühlmann et al. [2016] for a review of recent literature.

Competitive estimation aims to design estimators that are universally near-optimal. Without strong structural knowledge, a reasonable estimator should naturally assign the same probability to symbols appearing equal number of times. The objective here is to find an estimator that learns every distribution as well as the best natural estimator designed with knowledge of the true distribution. Discussion continues in Section 2.3 with a review of relevant works.

Property (Functional) Inference

Instead of recovering the underlying distribution, numerous practical applications require only inferring a particular property value, such as entropy for graphical modeling Koller and Friedman [2009], and support size for species richness estimation Magurran [2013].

Formally, a distribution property over a distribution collection  P ⊆ ∆Xis a functional  f : P → Rthat associates with each distribution in P a real value. Given a sample  Xnfrom an unknown distribution  p ∈ P, the problem of interest is to infer the value of f(p). To do this, we employee another functional ˆf : X ∗ → R, a property estimator that maps every sample to a real value.

The statistical efficiency of ˆfin estimating f with respect to the distribution collection P is measured by its sample complexity. Specifically, for an accuracy  ε > 0and error tolerance  δ ∈ (0, 1), the  (ε, δ)-sample complexity of ˆfwith respect to (f, P) is the minimal sample size n for which PrXn∼p(| ˆf(Xn) − f(p)| > ε) ≤ δfor all  p ∈ P. Note that for the special case of P = {p}, the sample complexity directly characterizes the ability of ˆfin estimating f(p).

Recent years have shown interests in determining the sample complexities of inferring distribution properties. Built upon worst-case analysis, the major contribution of these works is establishing the sufficiency of sample sizes sub-linear in |X|. As an example, in the vital sample-sparse regime and over  ∆X, the  (ε, 1/10)-sample complexity of learning entropy is  Θ(|X|/(ε log |X|)). We refer the readers to Verdú [2019] for a thorough survey of related works.

As the problem involves two components, the property and distribution, adaptive analysis also advances in two veins.

The first vein concerns constructing a universal plug-in estimator for all symmetric properties. A symmetric property is invariant under symbol permutations, hence it suffices to obtain an accurate estimate of the probability multiset. Recently, following the works of Das [2012], Acharya et al. [2017], Hao and Orlitsky [2019a] show that for any symmetric property that is additively separable and appropriately Lipschitz, the profile maximum likelihood (Section 2.2) achieves the optimal sample complexity up to small constant factors. Other major works include Valiant and Valiant [2011, 2013, 2016], Han et al. [2018], Charikar et al. [2019b]

The second vein is an analogy to the competitive distribution estimation framework, and aims to compete with the instance-by-instance performance of a genie having access to more information, but reasonably restricted. A natural choice for the genie is the best-known and most-used – the empirical estimator that evaluates the property at the sample empirical distribution. To empower the genie, we grant it access to a sample whose size is logarithmically larger than that available to the learner. One can show that this enables the genie to universally achieve the optimal sample complexities for numerous properties and hypothesis classes P. Under this formulation, Hao et al. [2018], Hao and Orlitsky [2019b] provide a unified learning algorithm that achieves the optimal competitiveness guarantees in near-linear time.

In this work, we further both veins of works and show that: 1) the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes; 2) when plugged into entropy, the estimator in Hao and Orlitsky [2019c] approximates the property as well as the plug-in estimator whose distribution component is the best natural, for every distribution. See Theorem 3 and 4 for the formal statements.

We establish essential connections between profile entropy and the estimation of distributions, inference of their properties, and compression of profiles. To further our understanding of profile entropy, we then investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.

For space considerations, we relegate most technical proofs to the appendices.

Permutation invariance By definition, both the profile of a sequence and its dimension are invariant to domain-symbol permutations. Since entropy is a symmetric property, the profile entropy of an i.i.d. sample is also permutation invariant. Consequently, a result in this section that holds for a distribution will also hold for any distributions sharing the same probability multiset.

This is desirable for practical applications, since samples often come as categorical data, while the symbol ordering under which the underlying distribution would exhibit certain structure is unknown to the learner. For example, in natural language processing, we observe words and punctuation marks. Given that the data comes from a power-law distribution Mitzenmacher [2004], we often don’t know how to order the alphabet to realize such a condition.

Surprisingly, with a few exceptions such as Hao and Orlitsky [2019c], most previous works on learning structured discrete distributions do not address this crucial matter in their learning algorithms. The existing results are rather artificial and more like learning discretized continuous distributions. See Section B.2 for our discussion on distribution discretization.

2.1 Profile Dimension and Entropy

image

where the notation ˜Θ(·)hides logarithmic factors of n.

The theorem shows that for every distribution and sampling parameter n, the induced profile entropy and profile dimension are of the same order, with high probability.

Taking expectation and noting that  D(Φn)∈[1,√2n]yield

image

2.2 Adaptive Property Estimation

Definitions A profile  φis said to have length n if there exists  xn ∈ X nsatisfying  φ = ϕ(xn). For every profile  φof length n and distribution collection  P ⊆ ∆X, the profile maximum likelihood (PML) estimator Orlitsky et al. [2004] over P maps  φto a distribution

image

that maximizes the probability of observing the profile  φ. For any property f, let  εf(n, δ, P)denote the smallest error that can be achieved by any estimator with a sample size n and tolerance  δon the error probability. This definition is equivalent to that of the sample complexity. Below, we assume that P is label invariant, i.e., for any  p ∈ P, collection P contains all its symbol-permuted versions.

We first show that profile-based estimators are sufficient for estimating symmetric properties. Theorem 2 (Sufficiency of profiles). Let f be a symmetric property over P. For any accuracy  ε > 0and tolerance  δ ∈ (0, 1), if there exists an estimator ˆfsuch that

image

there is an estimator ˆfϕover length-n profiles satisfying

image

Note that both estimators can have independent randomness.

The second result shows that the PML estimator is adaptive to the simplicity of underlying distributions in inferring all symmetric properties, over any label-invariant P. For clarity, we set  δ = 1/10and suppress both  δand P in  εf(n, δ, P).

Theorem 3 (Adaptiveness of PML). Let f be a symmetric property. For any  p ∈ Pand  Φn ∼ p, with probability at least  1 − O(1/√n),

image

Some comments: 1) The theorem holds for any symmetric properties, while nearly all previous works require the property to possess certain forms and be smooth; 2) The theorem trivially implies a weaker result in Acharya et al. [2017] where  ⌈H(Φn)⌉is replaced by √n; 3) There is a polynomial-time approximation Charikar et al. [2019a] achieving the same guarantee; 4) We provide a stronger result in Section A.5 of the appendices for general  δ.

Besides this theorem, we establish in Section A.6 and A.7 two additional results on PML. The first result addresses sorted distribution estimation, and improves over that established in Hao and Orlitsky [2019a] (Theorem 5) in terms of the lower bound on the accuracy parameter  ε. Let  ΣXdenote the collection of symbol permutations over X. For any  p ∈ ∆X, denote by  pσ ∈ ∆Xthe permuted distribution satisfying  pσ(x) = p(σ(x))for all symbols  x ∈ X. Let  λ > 0be a positive absolute constant that can be made arbitrarily small, e.g.,  λ = 0.001.

Lemma 1. For any  ε ∈ (0, 1), p ∈ P = ∆X, and  Φn ∼ p, if we have  n ≥ Ω(|X|/(ε2 log |X|))and ε ≥ 1/n1/8−λ, with probability at least  1 − O(exp(−√n)),

image

A few comments in order: 1) The polynomial-time computable variant of PML in Charikar et al. [2019a] satisfies the same guarantee, and the proof for this is also similar to that in Section A.6; 2) Using the existing efficiently computable PML-type methods Charikar et al. [2019a,b], the best possible lower bound on  εis  Θ(1/n1/4); 3) Below the  Θ(1/n1/3)threshold, the empirical distribution estimator is sample optimal up to constant factors Han et al. [2018].

The second result shows an intriguing connection between the PML method and the task of uniformity testing Goldreich and Ron [2011]. See Section A.7 for details.

Our last result in this section addresses entropy estimation. We show that when plugged into entropy, the estimator in Hao and Orlitsky [2019c] approximates the property as well as the plug-in estimator whose distribution component is the best natural, for every distribution.

Recall that a distribution estimator is natural if it assigns the same probability to symbols of equal multiplicity, and a property estimator is plug-in if it first finds an estimate of the distribution and then evaluates the property at this estimate. As an off-the-shelf method, the plug-in approach is widely used in estimating distribution properties.

If further the property is symmetric, then it suffices to obtain an accurate estimate of the probability multiset, which is intuitively more statistically efficient than recovering the actual distribution. For example, Hao and Orlitsky [2019a] recently show that for any symmetric property that is additively separable and appropriately Lipschitz, the PML multiset estimator Orlitsky et al. [2004] achieves the optimal sample complexity up to small constant factors.

However, the analysis and computation (though efficient) of such multiset-based estimation methods are often involved Valiant and Valiant [2011, 2013, 2016], Han et al. [2018], Charikar et al. [2019b], Hao and Orlitsky [2019a]. For this reason, distribution-based plug-in estimators are still popular in practice, and often, the distribution components are natural.

As an example, for entropy estimation, several widely used distribution-based estimators are natural plug-in, such as the empirical estimator plugging in the empirical distribution, James-Stein shrinkage Hausser and Strimmer [2009] that shrinks the distribution estimate towards uniform, and Dirichlet-smoothed Schürmann and Grassberger [1996] that imposes a Dirichlet prior over  ∆X.

The logic behind these estimators is simple: if two distributions are close, then the same is expected to hold for their entropy values. The next theorm shows that for every distribution and among all plug-in entropy estimators, the distribution estimator in Hao and Orlitsky [2019c] is as good as the one that performs best in estimating the actual distribution.

Denote by N the collection of all natural estimators. Write  |H(p) − H(q)|as  ℓH(p, q)for compactness and the KL-divergence between  p, q ∈ ∆Xas  ℓKL(p, q).

Theorem 4 (Competitive entropy estimation). For any distribution p, sample  Xn ∼ pwith profile Φn := ϕ(Xn), and  ˆpNXn := arg minˆp∈N ℓKL(p, ˆpXn ), we have

image

with probability at least  1 − O(1/n).

2.3 Competitive Distribution Estimation

Prior works Competitive estimation calls for an estimator that competes with the instance-by-instance performance of a genie knowing more information, but reasonably restricted. Denote by ℓKL(p, q)the KL divergence. Introduced in Orlitsky and Suresh [2015], the formulation considers the collection N of all natural estimators, and shows that a simple variant  ˆpGTof the Good-Turing estimator achieves

image

for every distribution p and with high probability. We refer to the left-hand side as the excess loss of estimator  ˆpGTwith respect to the best natural estimator, and note that it vanishes at a rate independent of p. For a more involved estimator in Acharya et al. [2013], the excess loss vanishes at a faster rate of ˜O(min{1/√n, |X|/n}), optimal up to logarithmic factors for every estimator and the respective worst-case distribution. For the  ℓ1distance, Valiant and Valiant [2016] derive a similar result.

These estimators track the loss of the best natural estimator for each distribution. Yet an equally important component, the excess loss bound, is still of the worst-case nature. For a fully adaptive guarantee, Hao and Orlitsky [2019c] design an estimator  ˆp⋆that achieves a  Dn/nexcess loss, i.e.,

image

for every p and  Xn ∼ p, with high probability. Utilizing the adaptiveness of  Dnto the simplicity of distributions, the paper derives excess-loss bounds for several important distribution families, and proves the estimator’s optimality under various of classical and modern learning frameworks.

New results While the work of Hao and Orlitsky [2019c] provides an appealing upper bound on the excess loss, it is not exactly clear how good this bound is as a matching lower bound is missing. In this work, we complete the picture by showing that the  Dn/nbound is essential for competitive estimation and optimal up to logarithmic factors of n.

Theorem 5 (Minimal excess loss). For any  n, D ∈ Nand distribution estimator  ˆp′, there is a distribution p such that with probability at least 9/10, we have both

image

By Theorem 1, we can replace  Dnby ˜Θ(H(Φn))in both the upper and lower bounds.

2.4 Optimal Profile Compression

While a labeled sample contains all information, for many modern applications, such as property estimation and differential privacy, it is sufficient Orlitsky et al. [2004] or even necessary to provide only the profile Suresh [2019]. Hence, this section focuses on the lossless compression of profiles.

For any distribution p, it is well-known that the minimal expected codeword length (MECL) for losslessly compressing a sample  Xn ∼ pis approximately nH(p), which increases linearly in n as long as H(p) is bounded away from zero.

On the other hand, by the Hardy-Ramanujan formula Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies

image

Consequently, the MECL for losslessly compressing the sample profile  Φn ∼ pis at most  O(√n), a number potentially much smaller than nH(p).

By Shannon’s source coding theorem, the profile entropy  H(Φn)is the information-theoretic limit of MECL for the lossless compression of profile  Φn ∼ p. Below, we present explicit block and sequential profile compression schemes achieving this entropy limit, up to logarithmic factors of n.

Block compression The block compression algorithm is intuitive and easy to implement.

Recall that the profile of a sequence  xnis the multiset  ϕ(xn)of multiplicities associated with symbols in  xn. The ordering of elements in a multiset is not informative. Hence equivalently, we can compress  ϕ(xn)into the set  C(ϕ(xn))of corresponding multiplicity-prevalence pairs, i.e.,

image

The number of pairs in  C(ϕ(xn))is equal to the profile dimension  D(ϕ(xn)). In addition, both a prevalence and its multiplicity are integers in [0, n], and storing the pair takes 2 log n nats. Hence, it takes at most  2(log n) · D(ϕ(xn))nats to store the compressed profile. By Theorem 1, for any distribution  p ∈ ∆Xand  Φn ∼ p,

image

Sequential compression For any sequence  xn, the setting for sequential profile compression is that at time step  t ∈ [n], the compression algorithm knows only  ϕ(xt)and sequentially encodes the new information. This is equivalent to providing the algorithm  µxt(xt−1)at time step t.

Suppress  x, xtin the expressions for the ease of illustration. For efficient compression, we sequentially encode the profile  ϕinto a self-balancing binary search tree T , with each node storing a multiplicity-prevalence pair  (µ, ϕµ)and  µbeing the search key. We present the algorithm details as follows.

image

The algorithm runs for exactly n iterations, with a O(log n) per-iteration time complexity. For an i.i.d. sample  Xn ∼ p, the expected space complexity is again ˜Θ(⌈H(Φn)⌉).

2.5 Attributes of Profile Entropy and Dimension

To further our understanding of profile entropy and dimension, we investigate the analytical and statistical attributes of these characteristics concerning their concentration, computation and approximation, monotonicity, and Lipschitzness.

Concentration

Recall that the multiplicity  µy(xn)denotes the number of times symbol y appearing in  xn. Denote by �the logical OR operator. For any distribution p and  Xn ∼ p, we have

image

The statistical dependency landscape of terms in the summation is rather complex, since  µx(Xn)and  µy(Xn)are dependent for every (x, y) pair due to the fixed sample size; and so are  1µx(Xn)=µ1and  1µx(Xn)=µ2for every pair of distinct  µ1and  µ2. To simplify the derivations, we relate this quantity to its variant under the aforementioned Poisson sampling scheme, i.e., making the sample size an independent  N ∼ Poi(n). Specifically, define

image

Note that this is not the same as  DNsince the summation index goes up only to n. Denote the expected value of ˜DNby  En(p). Our result shows that the original  Dnsatisfies a Chernoff-Hoeffding type bound centered at  En(p).

Theorem 6. Under the above conditions and for any  n ∈ Z+, p ∈ ∆X, and  γ > 0,

image

and for any  γ ∈ (0, 1),

image

As a corollary, the value of  Dnis often close to  En(p). Corollary 1. Under the same conditions as above and for any  n ∈ Z+and distribution  p ∈ ∆X, with probability at least  1 − 6/√n,

image

In addition, we establish an Efron-Stein type inequality.

Theorem 7. For any distribution p and  Dn ∼ p,

image

Computation and Approximation

The above results show that  Dn ∼ pis often close to  En(p), with an exponentially small deviation probability. Hence, to approximate  Dn, it suffices to accurately compute  En(p), the expectation of its Poissonized version ˜DN. By independence and the linearity of expectations,

image

The expression is exact but does not relate to p in a simple manner. For an intuitive approximation, we partition the unit interval into a sequence of ranges,

image

denote by  pIjthe number of probabilities in  Ij, and relate  En(p)to a shape-reflecting quantity

image

the sum of the effective number of probabilities lying within each range Hao and Orlitsky [2019c]. To compute  HSn (p), we simply count the number of probabilities in each  Ij. Our main result shows that  HSn (p)well approximates  En(p)over the entire  ∆X, up to logarithmic factors of n.

Theorem 8. For any  n ∈ Z+and  p ∈ ∆X,

image

Summary The simple expression shows that  HSn (p)characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,  HSn (p)closely approximates  En(p), the value around which  Dn ∼ pconcentrates (Theorem 6). Henceforth, we use  HSn (p)as a proxy for both  H(Φn)and  Dn, and study its attributes and values.

Monotonicity

Among the many attributes that  HSn (p)possesses, monotonicity is perhaps most intuitive. One may expect a larger value of  HSn (p)as the sample size n increases, since additional observations reveal more information on the variability of probabilities. Below we confirm this intuition.

Theorem 9. For any  n ≥ m ≫ 1and  p ∈ ∆X,

image

Besides the above result that lowerly bounds  HSn (p)with  HSm(p)for  m ≤ n, a more desirable result is to upperly bound  HSn (p)with a function of  HSm(p).

Such a result will enable us to draw a sample of size  m ≤ n, obtain an estimate of  HSm(p)from  Dm, and use it to bound the value of  HSn (p)and thus of  Dnfor a much larger sample size n. With such an estimate, we can perform numerous tasks such as predicting the performance of algorithms in Section 2.2 and 2.3 when more observations are available, and the space needed for storing a longer sample profile. The next theorem provides a simple and tight bound on  HSn (p)in terms of  HSm(p).

Theorem 10. For any  n ≥ m ≫ 1and  p ∈ ∆X,

image

The aforementioned application of this result is closely related to the recent works on learnability estimation Kong and Valiant [2018], Kong et al. [2019].

Lipschitzness

Viewing  HSn (p)as a distribution property, we establish its Lipschitzness with respect to a weighted Hamming distance and the  ℓ1distance. Given two distributions  p, q ∈ ∆X, the vanilla Hamming distance is denoted by

image

The distance is suitable for being a statistical distance since there may be many symbols at which the two distributions differ, yet those symbols account for only a negligible total probability and has little effects on many induced statistics. To address this, we propose a weighted Hamming distance

image

The next result measures the Lipschitzness of  HSnunder  hW.

Theorem 11. For any integer n, and distributions p and q, if  hW(p, q) ≤ εfor some  ε ≥ 1/n, ��HSn (p) − HSn (q)�� ≤ O(√εn).

Replacing  max{px, qx}with  |px − qx|results in a common similarity measure – the  ℓ1distance. The next theorem is an analog to the above under this classical distance.

Theorem 12. For any integer n, and distributions p and q, if  ℓ1(p, q) ≤ εfor some  ε ≥ 0, ��HSn (p) − cHSn (q)�� ≤ O((εn)2/3),

where c is a constant in [1/3, 3]. Note that the inequality is significant iff  ε ≤ ˜Θ(1/n1/4), since the value of  HSn (p)is at most  O(√n log n)for all p.

2.6 Profile Entropy for Structured Families

Following the study of attributes of profile entropy, we derive nearly tight bounds for the  HSn (p)values of three important structured families, log-concave, power-law, and histogram. These bounds tighten up and significantly improve those in Hao and Orlitsky [2019c], and show the ability of profile entropy in charactering natural shape constraints.

Below, we follow the convention of specifying the structured distributions over X = Z.

Log-Concave Distributions

We say that  p ∈ ∆Zis log-concave if p has a contiguous support and  p2x ≥ px−1px+1for all  x ∈ Z.

The log-concave family encompasses a broad range of discrete distributions, such as Poisson, hyperPoisson, Poisson binomial, binomial, negative binomial, geometric, and hyper-geometric, with wide applications to numerous research areas, including statistics Saumard and Wellner [2014], computer science Lovász and Vempala [2007], economics An [1997], algebra, and geometry Stanley [1989].

The next result upperly bounds the profile entropy of log-concave families, and is tight up to logarithmic factors of n.

Theorem 13. For any  n ∈ Z+and distribution  p ∈ ∆Z, if p is log-concave and has a variance of  σ2,

image

This upper bound is uniformly better than the  min{σ, (n2/σ)1/3}bound in Hao and Orlitsky [2019c].

A similar bound holds for t-mixtures of log-concave distributions. More concretely,

Theorem 14. For any integer n and distribution  p ∈ ∆Z, if p is a t-mixture of log-concave distributions each has a variance of  σ2i, where i = 1, . . . , t,

image

The introduction about log-concave families covers numerous classical discrete distributions, yet leaves many more continuous ones untouched Bagnoli and Bergstrom [2005]. Below, we present a discretization procedure that preserves distribution shapes such as monotonicity, modality, and log-concavity. Applying this procedure to the Gaussian distribution  N(µ, σ2)further shows the optimality of Theorem 13.

Discretization

Let X be a continuous random variable over R with density function f(x). For any  x ∈ R, denote by  ⌈x⌋the closest integer z such that  x ∈ (z − 1/2, z + 1/2]. The  ⌈X⌋has a distribution over Z:

image

We refer to  ⌈X⌋as the discretized version of X.

Shape preservation By definition, one can readily verify that the above transformation preserves several important shape characteristics of distributions, such as monotonicity, modality, and k-modality (possibly yields a smaller k). The following theorem covers log-concavity.

Theorem 15. For any continuous random variable X over R with a log-concave density f, the distribution  p ∈ ∆Zassociated with  ⌈X⌋is also log-concave.

Moment preservation Denote by p the distribution of  ⌈X⌋for  X ∼ f. Let  µand  σ2be the mean and variance of density f, given that they exist. The theorem below shows that the discrete distribution p has, within small additive absolute constants, a mean of  µand variance of  Θ(σ2).

Theorem 16. Under the aforementioned conditions, the mean of  ⌈X⌋satisfies

image

and the variance of  ⌈X⌋satisfies

image

Optimality of Theorem 13

By the above formula, the discretized Gaussian  ⌈N⌋(µ, σ2)has a distribution in the form of

image

Consolidating Theorem 15 and 16 shows that  pGis a log-concave distribution with a variance of Θ(σ2) ± 1. Consequently, Theorem 13 yields the following upper bound:

image

The optimality of Theorem 13 follows by these inequalities.

Power-Law Distributions

We say that a discrete distribution  p ∈ ∆Zis a power-law with power  α > 0if p has a support of [k] := {1, . . ., k} for some  k ∈ Z+ ∪ {∞}and  px ∝ x−αfor all  x ∈ [k].

Power-law is a ubiquitous structure appearing in many situations of scientific interest, ranging from natural phenomena such as the initial mass function of stars Kroupa [2001], species and genera Humphries et al. [2010], rainfall Machado and Rossow [1993], population dynamics Taylor [1961], and brain surface electric potential Miller et al. [2009], to man-made circumstances such as the word frequencies in a text Baayen [2002], income rankings Dr˘agulescu and Yakovenko [2001], company sizes Axtell [2001], and internet topology Faloutsos et al. [1999].

Unlike log-concave distributions that concentrate around their mean values, power-laws are known to possess “long-tails” and always log-convex. Hence, one may expect the profile entropy of power-law distributions to behave differently from that of log-concave ones. The next theorem justifies this intuition and provides tight upper bounds.

Theorem 18. For a power-law distribution  p ∈ ∆[k]with power  α, we have

image

where

image

The above upper bound fully characterizes the profile entropy of power-laws and surpasses the basic {k, √n log n}bound for both  k ≫ √nand  k ≪ √n. In comparison, Hao and Orlitsky [2019c] yields a  O(nmin{1/(1+α),1/2})upper bound, which improves over √n log nfor only  α > 1and is worse than that above for all  α < 1+1/log k.

Histogram Distributions

A distribution  p ∈ ∆Xis a t-histogram distribution if there is a partition of X into t parts such that p has the same probability value over all symbols in each part.

Besides the long line of research on histograms reviewed in Ioannidis [2003], the importance of histogram distributions rises with the rapid growth of data sizes in numerous engineering and science applications in the modern era.

For example, in scenarios where processing the complete data set is inefficient or even impossible, a standard solution is to partition/cluster the data into groups according to the task specifications and element similarities, and randomly sample from each group to obtain a subset of the data to use. This naturally induces a histogram distribution, with each data point being a symbol in the support.

The work of Hao and Orlitsky [2019c] studies the class of t-histogram distributions and obtains the following upper bound

image

Our contribution is establishing its optimality. Theorem 19. For any  t, n ∈ Z+, there exists a t-histogram distribution p such that

image

Note that uniform distributions correspond to 1-histograms, for which the bounds reduce to ˜Θ(n1/3).

3.1 Multi-Dimensional Profiles

The notion of profile generalizes to the multi-sequence setting.

Let X be a finite or countably infinite alphabet. For every  ⃗n := (n1, . . . , nd) ∈ Ndand tuple x⃗n := (xn11 , . . . , xndd )of sequences in  X ∗, the multiplicity  µy(x⃗n)of a symbol  y ∈ Xis the vector of its frequencies in the tuple of sequences. The profile of  x⃗nis the multiset  ϕ(x⃗n)of multiplicities of the observed symbols Acharya et al. [2010], Das [2012], Charikar et al. [2019b], and its dimension is the number  D(x⃗n)of distinct elements in the multiset. Drawing independent samples from  ⃗p :=(p1, . . . , pd) ∈ ∆dX, the profile entropy is simply the entropy of the joint-sample profile.

Many of the previous results potentially generalize to this multi-dimensional setting. For example, the√2nbound on  D(x⃗n)in the 1-dimensional case becomes

image

and

image

This essentially recovers the√2nbound for d = 1.

3.2 Concluding Remarks

The classical view on the entropy of an i.i.d. sample corresponds to the equation

image

which provides little insight for statistical applications.

This paper presents a different view by decomposing the nH(p) information into three pieces: the labeling of the profile elements, ordering of them, and profile entropy. With no bias towards any symbols and under the i.i.d. assumption, the profile entropy rises as a fundamental measure unifying the concepts of estimation, inference, and compression.

Appendices orgnization In the appendices, we order the results and proofs according to their logical priority. In other words, the proof of a theorem or lemma mainly relies on preceding results. For the ease of reference, the numbering of the theorems is consistent with that in the main paper.

Consider an arbitrary sequence  xnover a finite or countably infinite alphabet X. The multiplicity µy(xn)of a symbol  y ∈ Xis the frequency of y in  xn. The prevalence of an integer  µis the number  ϕµ(xn)of symbols in  xnwith multiplicity  µ. The profile of  xnis the multiset  ϕ(xn)of multiplicities of the symbols in  xn, which we describe as a profile of length n.

Let  ∆Xbe a collection of distributions over X. We say that a distribution collection  P ⊆ ∆Xis label-invariant if for any  p ∈ P, the collection P contains all its symbol-permuted versions. A distribution property over a distribution collection  P ⊆ ∆Xis a functional  f : P → Rthat associates with each distribution in P a real value. For a label-invariant  P ⊆ ∆X, we say that a property f over P is symmetric if it takes the same value for distributions sharing the same probability multiset.

A.1 Profile Dimension and Its Concentration

A profile  φis said to have length n if there exists  xn ∈ X nsatisfying  φ = ϕ(xn). For any multiset S, we define its dimension as the number D(S) of distinct elements in S. Recall that profiles of sequences are multisets. For notational convenience, we write both  D(ϕ(xn))and  D(xn)for the dimension of profile  ϕ(xn).

Viewed as a random variable, the profile of an i.i.d. sample  Xn ∼ phas a distribution depending only on p and n. Hence, we denote by  Φnsuch a profile and write  Φn ∼ p. Analogously, we denote by  Dn := D(Φn)the profile dimension associated with (p, n), and write  Dn ∼ p.

Recall that the multiplicity  µy(xn)denotes the number of times symbol y appearing in  xn. Denote by �the logical OR operator. For any distribution p and  Xn ∼ p, we have

image

The statistical dependency landscape of terms in the summation is rather complex, since  µx(Xn)and  µy(Xn)are dependent for every (x, y) pair due to the fixed sample size; and so are  1µx(Xn)=µ1and  1µx(Xn)=µ2for every pair of distinct  µ1and  µ2. To simplify the derivations, we relate this quantity to its variant under the aforementioned Poisson sampling scheme, i.e., making the sample size an independent  N ∼ Poi(n). Specifically, define

image

Note that this is not the same as  DNsince the summation index goes up only to n. Denote the expected value of ˜DNby  En(p). Our result shows that the original  Dnsatisfies a Chernoff-Hoeffding type bound centered at  En(p).

Theorem 6. Under the above conditions and for any  n ∈ Z+, p ∈ ∆X, and  γ > 0,

image

and for any  γ ∈ (0, 1),

image

Proof. To simplify our analysis, we first consider an alternative model where the sample size is an independent Poisson random variable N with mean n. A nice attribute of Poisson sampling is that all the multiplicities  µy(Xn)are independent of each other. Later, we will relate this model to the fixed-sample-size model and establish our claim rigorously.

For simplicity and clarity, we suppress  Xnin  µy(Xn)and write  νyinstead of  µywhen the multiplicity is obtained through Poisson sampling. For any  i ∈ [n], denote  Gi({νx}x) := �x∈X 1νx=i. Instead of analyzing  DN, we consider

image

Note that for any disjoint  I, J ⊆ [n], the functions �i∈I Gi({νx}x)and  �j∈J Gj({νx}x)are discordant monotone by each argument, namely, when we increase the value of each  νx, the increase in the value of one function implies the non-increase of the other. Then, by the results in Lehmann [1966], the values of the two functions, when viewed as random variables, are negatively associated.

Next we show that quantity ˜DNsatisfies a Chernoff-type bound.

Let  γbe an arbitrary positive number. Note that  Giis a Bernoulli random variable with parameter

image

Then for the expected value of ˜DN, we have

image

For simplicity, write  Y := ˜DNand  µ := En(p). By Markov’s inequality and the monotonicity of function  etyover t > 0,

image

It suffices to bound  E[etY ]by a function of other parameters.

image

where (a) follows by the definition of Y ; (b) follows by  ea+b = ea · eb; (c)follows by the fact that  G1is negatively associated with �ni=2 Gi; (d)follows by an induction argument via negative association; (e) follows by the fact that  Giis a Bernoulli random variable with mean  qi; (f)follows by the inequality  1 + x ≤ ex, ∀x ≥ 0; (g)follows by  ea · eb = ea+b; and (h) follows by  µ = �i qi. Applying standard simplifications, we obtain

image

The proof will be complete upon noting that: 1) the probability that N = n is at least  1/(3√n);

2) conditioning on N = n transforms the sampling model to that with a fixed sample size n.

As a corollary, the value of  Dnis often close to  En(p).

Corollary 2. Under the same conditions as above and for any  n ∈ Z+, p ∈ ∆X, with probability

image

Proof. To establish the lower bound, note that if  En(p) ≥ 3 log n, setting  γ = 1in Theorem 6 yields

image

else if  En(p) < 3 log n, setting  γ = (3 log n)/En(p)yields

image

As for the upper bound, if  En(p) ≥ 8 log n,

image

and for any  En(p) < 8 log n,

image

Combining these tail bounds through the union bound completes the proof.

In addition to the above, we establish an Efron-Stein type inequality.

Theorem 7. For any distribution p and  Dn ∼ p,

image

Proof. First, note that for any  j, t ∈ [n]and  j ̸= t,

image

Therefore, the variance of the profile dimension  Dnsatisfies

image

A.2 Profile Entropy and Its Connection to Dimension

For a distribution  p ∈ ∆Xand sampling parameter n, the profile entropy with respect to (p, n) is the entropy  H(Φn)of the sample profile  Φn ∼ p. By Shannon’s source coding theorem, profile entropy  H(Φn)is the information-theoretic limit of the minimal expected codeword length (MECL) for the lossless compression of the sample profile. Hence, characterizing its value is of fundamental importance. But as one may expect, the distribution of  Φnis sophisticated and over a large alphabet.

More concretely, by the formula of Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies the equation

image

Despite the complex statistical dependency landscape and the exponentially large alphabet size, below we establish that for any distribution and sample size, the profile entropy is often of the same order as the profile size, with high probability. Specifically,

Theorem 1. For any distribution  p ∈ ∆Xand  Φn ∼ p, with probability at least  1 − O(1/√n),

image

We decompose the proof of the theorem into three steps. First, we show that  ⌈H(Φn)⌉ ≤ ˜Θ(D(Φn))with high probability, which is a simple consequence of Shannon’s source coding theorem and Theorem 6 (which shows that  D(Φn)highly concentrates around its expectation). Then, we introduce a simple quantity  HSn (p)that approximates the expectation of  D(Φn)to within logarithmic factors of n. Finally, leveraging this approximation guarantee, we establish the other direction of the theorem. This step is more involved due to the aforementioned complications.

A. Bounding Profile Entropy by Its Dimension

By the tail bounds (Theorem 6) and trivial lower bound of 1 on the profile dimension, with probability at least  1 − O(1/√n), the expectation of  D(Φn)satisfies

image

By the block profile compression algorithm presented in Section 2.4 of the main paper, storing profile  Φn ∼ plosslessly takes

image

nats space in expectation. By Shannon’s source coding theorem, the expected space to losslessly storing a random variable is at least its entropy. Hence, with probability at least  1 − O(1/√n),

image

Again, noting that  D(Φn) ≥ 1completes the proof.

B. Simple Approximation Formula for Profile Dimension

It remains to show that  ⌈H(Φn)⌉ ≥ ˜Ω(D(Φn)), with high probability. To proceed further, we note that  D(Φn) = Dn ∼ pis often close to  En(p), the expectation of its Poissonized version ˜DN, with an exponentially small deviation probability. Hence, to approximate  Dn, it suffices to accurately compute  En(p). By independence and the linearity of expectations,

image

The expression is exact but does not relate to p in a simple manner. For an intuitive approximation, we partition the unit interval into a sequence of ranges,

image

denote by  pIjthe number of probabilities  pxbelonging to  Ij, and relate  En(p)to an induced shape-reflecting quantity,

image

the sum of the effective number of probabilities lying within each range Hao and Orlitsky [2019c]. To compute  HSn (p), we simply count the number of probabilities in each  Ij. Our main result shows that  HSn (p)well approximates  En(p)over the entire  ∆X, up to logarithmic factors of n.

Theorem 8. For any  n ∈ Z+and  p ∈ ∆X,

image

Proof. The fact that  O(HSn (p))upperly bounds  E[ ˜DN]simply follows by the concentration of Pois- son variables, and is established in Hao and Orlitsky [2019c]. Below we show that the quantity also serves as a lower bound. By construction, for any given sampling parameter n, index j, and symbol x with probability  px ∈ Ij, the corresponding symbol multiplicity  µx ∼ Poi(npx). Hence, we can express the expectation of ˜DNas

image

This proves the aforementioned formula. Then, for every sufficiently large index j and  i ∈ Sj :=[(j − 1)2, j2] log n, define a sequence of intervals,

image

Now we analyze the contribution of indices  i ∈ Sjto the expected value of ˜DN. For clarity, we divide our analysis into two cases:  pIj ≥ j log nand  pIj < j log n.

Consider the collection  Pjof probabilities  px ∈ Ij, and the collection  Ijof intervals  Iij, i ∈ Sj. By construction, each probability in  Pjis contained in at least  j√log nmany intervals in  Ij. Hence the total number of probabilities (repeatedly counted) included in  Ijis at least  pIj · j√log n. Note that the number of intervals in  Ijis less than 2j log n. We claim that there exists one (or more) interval  Ii′j ∈ Ijcontaining at least  pIj/(2√log n)probabilities. By construction, there are at least  j√log n/2neighboring intervals of  Ii′jthat contain at least  pIj/(4√log n)probabilities. The contribution of these these intervals to the expected value of ˜DNis at least  j√log n/2times

image

where the last step holds if  pIj ≤ j log n. This yields a lower bound of  Θ(pIj/√log n).

It remains to consider the  pIj > j log ncase. Again, the total number of probabilities included in  Ijis at least  pIj · j√log n. Furthermore, each interval  Iijcontains at most  pIjprobabilities and there are less than 2j log n intervals. Therefore, the number of intervals that contain at least  j√log n/4probabilities is at least  j√log n/2. Otherwise, the number of probabilities included in  Ijis less than

image

which leads to a contradiction. Analogously, the contribution of these these intervals to the expected value of ˜DNis at least  j√log n/2times

image

which yields a lower bound of  Θ(j√log n)on the expected value of ˜DN.

Consolidating the previous results shows that

image

C. Bounding Profile Dimension by Its Entropy

Next, we establish that for any distribution  p ∈ ∆X , Φn ∼ p, with probability at least  1−O(1/√n),

image

Let p be an arbitrary distribution in  ∆X. Recall that we partition the interval (0, 1] into a sequence of sub-intervals,

image

and denote by  pIjthe number of probabilities  pxin  Ij.

Our current objective is to bound  H(Φn ∼ p)from below by a nontrivial multiple of  HSn (p). For simplicity of derivations, we will adopt the standard Poisson sampling scheme and make the sample size an independent Poisson variable  N ∼ Poi(n). For notational simplicity, we will suppress  XNin all the expressions and write the profile as  ϕ := ΦNby slightly abusing the notation.

Note that the profile can be equivalently expressed as a length-n vector

image

where  ϕidenotes the number of symbols appearing exactly i times.

For a sufficiently large absolute constant c, decompose  ϕinto c parts according to  Ijsuch that the t-th part (t = 1, . . . , c) consists of  ϕi’s satisfying  i ∈ nIjwith  j ≡ t mod c. Since by definition,

image

one of the c parts corresponds to a partial sum of at least  HSn (p)/c. Without loss of generality, we assume that it is the second part, i.e.,

image

Apply standard Poisson tail probability bounds, e.g., Lemma 2. Let Y be a Poisson or binomial random variable with mean value  λ. Then,

image

and

image

For any  j ≡ 1 mod cand with probability at least  1 − 1/n4, one can express the truncated profile (ϕi)i∈nIjover  Ijas a function of  µxfor x satisfying  npx ∈ Ij′, j′ ∈ (j − c/2, j + c/2).

Basically, this says that for every x, the number of its appearance is not too far away from the expected value. By the union bound, this is true for all  j ≡ 1 mod cwith probability at least 1 − 1/n3, as j can take only n possible values. Denote the last event by A.

To proceed, we recall the formula of Hardy and Ramanujan [1918] on the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles:

image

Below, we will use a weaker version that works for any n:

image

Then, conditioning on A, the truncated profiles  (ϕi)i∈nIjfor  j ≡ 1 mod care independent. Since conditioning reduces entropy,

image

where the third last step follows by

image

the second last follows by  H(X) ≤ log kfor any X with a support size of k, and the fact that there are at most  exp(3√m)many profiles of length m, as we explained above; and the last step follows by the elementary inequality

image

Our new objective is to bound  H((ϕi)i∈nIj)from below. We will find a sub-interval  Isjof  Ijand bound  H((ϕi)i∈nIsj )in the rest of the section, since

image

For all  j ≡ 1 mod c, our lower bound is simply

image

Henceforth, we assume that j is sufficiently large and denote  Lj := j√log n.

For any j and every integer  i ∈ Sj := [(j − 1)2, j2] log n, define a sequence of intervals,

image

On the other hand, the following upper bound holds.

image

In other words, for any  px, i/n ∈ Ijthat differ by at most  Lj/n,

image

Partition  Ijinto sub-intervals of equal length  Lj/n. The partition has a size of at most  2√log n. Assign each probability  px ∈ Ija length-Lj/ninterval  Ipxcentered at  px. Then, each interval  Ipxcovers at least one of the sub-intervals in the partition. Since there are exactly  pIjintervals  Ipx, one can find a partition sub-interval  Isjcontained in at least  pIj/(2√log n)of them. Denote by  Xsthe collection of symbols corresponding to these intervals.

Next, we bound from below the entropy of the truncated profile  (ϕi)i∈nIsjover  nIsj. Denote by  jsthe left end point of  nIsj. By the chain rule of entropy for multiple random variables,

image

Consider a particular term on the right-hand side with  i ∈ [js, js + Lj − 1]. By the conditional independence and fact that conditioning reduces entropy,

image

To characterize the condition, we define a random variable

image

Note that  E[1js≤µx≤i−1] = �i−1t=js Pr(Poi(npx) = t) ≤ (i − js)/(2Lj), which is at most 1/10 for i ≤ js + Lj/5. The following lemma transforms this into a high-probability statement.

Lemma 3. Let  Yi, i ∈ [1, m]be independent indicator random variables. Let  Y := �i Yidenote their sum and  λ := E[Y ]denote the expected sum. Then for c > 0, we have

image

Below we consider only  i ≤ js + Lj/5. Note that c/(2 + 2c/3) is increasing for c > 0.

image

where we set c = 4 in the above lemma and assume that  |Xs| ≥ 3(assuming only  |Xs| ≥ 1, the upper bound becomes 3/4). Recall that

image

Denote by  Vs ⊆ {0, 1}Xthe collection of  (cx)x∈Xsatisfying �x∈Xs cx < |Xs|/2. The above derivation shows that

image

By independence, for any  (cx)x∈X ∈ Vs, we have

image

For any  x ∈ Xswith  cx = 0, the corresponding indicator variable satisfies

image

On the other hand, for any  x ̸∈ Xs,

image

Therefore, the corresponding indicator variable satisfies

image

To summarize, we have shown that  (ϕi|1js≤µx≤i−1 = cx, x ∈ Xs)is the sum of |X| independent Bernoulli random variables. Among these Bernoulli variables, at least  |Xs|/2 ≥ pIj/(2√log n)have a bias of 1Lj� 19, 59�, while others have a bias of at most 59 · 1Lj.

The following lemma, recently established by Hillion et al. [2019], shows the relation among the entropy values of sums of independent Bernoulli random variables with different bias parameters.

Lemma 4. Let  Xt, Yt, t ∈ [m]be independent indicator random variables. Denote by X and Y the sums of  Xt’s and  Yt’s, respectively. If  E[Xt] ≤ E[Yt] ≤ 1/2, ∀t ∈ [m],

image

This lemma, together with the previous results, shows that

image

Proof. By definition, the left-hand side satisfies

image

By Stirling’s formula, for any  t ≥ 1,

image

Substituting the right-hand side into the above equation yields

image

Let g(x) := 0 for  x ∈ [0, 1)and g(x) := (x + 1/2) log x for  x ≥ 1. Simple calculus shows that the function is concave. Applying the concavity of g to the last sum yields

image

where the last step follows by the fact that  mq ≥ 1. A similar inequality holds for the weighted sum of  log(m − t)!. Consolidating these inequalities, we obtain

image

On the other hand, for the log m! term,

image

Substituting the previous term bounds into the H(bin(m, q)) expression yields

image

Before continuing, we remark that the bound in the above lemma has the right dependence on mq(1−q)in the sense that if we fix q and increase m, the lower bound converges to 12 log(Θ(mq(1−q))). Another point to mention is that the above bound covers  q ∈ [1/m, 1 − 1/m], while Lemma 6 appearing later in this section covers  q ̸∈ [1/m, 1 − 1/m]. Note that the dependence on  mq(1 − q)changes from logarithmic to linear, showing an “elbow effect” around 1/m.

Assume that  pIj/(2√log n) ≥ 9Lj, then for any  (cx)x∈X ∈ Vs,

image

Consolidating this with the previous results yields that

image

where we utilize  pIj/(2√log n) ≥ 9Lj ≥ 9and  (1 − q)m + qm < 1/efor  ∀m ≥ 3, q ∈ [1/m, 1/2]. We can then bound the quantity of interest as follows.

image

On the other hand, if  9Lj ≥ pIj/(2√log n) ≫ 1, we can further “compress” the truncated profile (ϕi)i∈nIsjover  nIsjto reduce the effective value of  Lj. Specifically, for any integer  t < Lj, we define the t-compressed version of  (ϕi)i∈nIsjas

image

Note that for each t, the length of  (ϕi)ti∈nIsjis  Ltj := Lj/t. For each entry in the compressed version, we can again express the entry as the sum of independent indicator random variables. Specifically,

image

Furthermore, for any  x ∈ Xs, the expectation of each indicator variable satisfies

image

Similarly, for any  x ∈ X, we have  E[1µx∈[js+(ℓ−1)t,js+ℓt−1]] ≤ 1/(2Ltj).

Now, choose t large enough so that  18Ltj ≥ pIj/(2√log n) ≥ 9Ltj. Following the reasoning in the previous case shows that

image

It remains to consider the case of  O(√log n) ≥ pIj ≥ 1, for which we adopt our previous analysis.

Again, partition  Ijinto sub-intervals of equal length  Lj/n. Then, assign each probability  px ∈ Ija length-Lj/ninterval  Ipxcentered at  px. By construction, each interval  Ipxcovers at least one of the sub-intervals in the partition. Redefine any of these covered sub-intervals as  Isj. Denote by  Xsthe collection of symbols corresponding to the covering intervals.

Note that  O(√log n) ≥ pIj ≥ |Xs| ≥ 1. For any  i ∈ [js, js+Lj/5], the previous analysis shows that

image

Proof. By symmetry, we need to consider only the case of  q ∈ [0, 1/m].

image

Consolidating the lemma and the chain rule of entropy yields,

image

Alternatively, we can use the fact that adding independent random variables does not decrease entropy, i.e.,  H(Y + Z) ≥ H(Y )for any independent variables Y and Z. Note that

image

Let y be an arbitrary symbol that belongs to  Xs. Then,

image

By the previous derivations, both  Pr(µy = js)and  Pr(µy = js + 1)belong to 1Lj [1/9, 1/2]. Hence,

image

Note that this argument does not apply to other cases, since

image

while  min�pIj, j · log n�can be as large as ˜Θ(n1/3)in general.

The proof is complete upon noting that indices with j = O(1) corresponds to a total contribution of at most O(1) to  HSn (p)and  HSn (p) = ˜Θ(E[D(ϕ)]) = ˜Θ(D(ϕ)), with probability at least  1 −O(1/√n).

Summary The simple expression shows that  HSn (p)characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,  HSn (p)closely approximates  En(p), the value around which  Dn ∼ pconcentrates (Theorem 6) and  H(Φn)lies (Thoerem 1). Henceforth, we use  HSn (p)as a proxy for both  H(Φn)and  Dn, and study its attributes and values.

A.3 Symmetric Property Estimation and Sufficiency of Profiles

The rest of Section A shows that the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes. For clarity, we divide the full proof into three parts: a) the sufficiency of profiles for estimating symmetric properties; b) the standard “median trick” often used to boost the confidence of learning algorithms; c) the PML method and its competitiveness to the min-max estimators. The proof utilizes several previously established results.

Sufficiency of profiles We first show that profile-based estimators are sufficient for estimating symmetric properties. Recall that a distribution collection  P ⊆ ∆Xis label-invariant if for any p ∈ P, the collection P contains all its symbol-permuted versions. Then,

Theorem 2. Let f be a symmetric functional over a label-invariant distribution collection  P ⊆ ∆X. For any accuracy  ε > 0and tolerance  δ ∈ (0, 1), if there exists an estimator ˆfsuch that

image

there is an estimator ˆfϕover  Φsatisfying

image

Note that both estimators can have independent randomness.

Proof. First we show that given estimator ˆf, there is an estimator ˆfswhich is symmetric, i.e., invariant with respect to domain-symbol permutations, and achieves the same guarantee. To see this, consider a random permutation  ˜σchosen uniformly randomly from the collection of permutations over the underlying alphabet. Let ˆfs := ˆf ◦ ˜σ. Then for any  p ∈ P,

image

where (a) follows by the definition of ˆfs; (b)follows by the law of total probability; (c) follows by the independence between  ˜σand  Xn; (d)follows by the symmetry of f and the equivalence of applying  σto  Xnand to p; (e) follows by the fact that  σ(p) ∈ Pand the guarantee satisfied by the estimator ˆf; and (f) follows by the law of total probability.

Before we proceed further, we introduce the following definitions. For any sequence  xn, the sketch of a symbol x in  xnis the set of indices  i ∈ [n]for which  xi = x. The type of a sequence  xnis the set  τ(xn)of sketches of symbols appearing in  xn.

Since ˆfsis symmetric, there exists a mapping ˆfτover types satisfying ˆfs = ˆfτ ◦ τ. Due to the i.i.d. assumption on the sample generation process, given the profile of a sample sequence, all the different types corresponding to this profile are equally likely. Let  Λbe a mapping that recovers this relation, i.e.,  Λmaps each profile uniformly randomly to a type having this profile.

Then, for any  p ∈ Pand  Xn ∼ p,

image

Consequently, the mapping ˆfϕ := ˆfτ ◦ Λis a profile-based estimator that satisfies

image

A.4 Median Trick

The following argument is standard and often used to boost the confidence of learning algorithms.

Lemma 7 (Median trick). Let  α, β ∈ (0, 1)be real parameters satisfying  1/10 ≥ α > β. For an accuracy  ε > 0and a distribution set  P ⊆ ∆X, if there exists an estimator ˆfAsuch that

image

we can construct another estimator ˆfBthat takes a sample of size  m :=� 4nlog12α log 1β�and achieves

image

Proof. Given  t ∈ Ni.i.d. copies of ˆfA(Xn), the probability that less than half of them satisfy the inequality in the parentheses is at least

image

By the law of total probability, the right-hand side equals to

image

where the first step follows by the Chernoff bound of binomial random variables, and the second step follows by  α ≤ 1/10and the inequality  c − 1 − c2 log c > 0, ∀c ≥ 5.

Set  t :=� 4log 12α log 1β�, the right-hand side is at least  1 − β.

Therefore, given a sample of size  m = t·n, we can partition it into t sub-samples of equal size, apply the estimator ˆfAto each subsample, and define the median of the corresponding estimates as ˆfB.

By the previous reasoning, this estimator satisfies

image

A.5 Profile Maximum Likelihood and Its Adaptiveness

For every profile  φof length n and distribution collection  P ⊆ ∆X, the profile maximum likelihood (PML) estimator Orlitsky et al. [2004] over P maps  φto a distribution

image

that maximizes the probability of observing the profile  φ.

For any property f, let  εf(n, δ, P)denote the smallest error that can be achieved by any estimator with a sample size n and tolerance  δon the error probability. This definition is equivalent to that of the sample complexity. In the following, we show that the PML estimator is adaptive to the simplicity of underlying distributions in inferring all symmetric properties, over any label-invariant P.

For brevity, set  δ = 1/10and suppress both  δand P in  εf(n, δ, P).

Theorem 3 (Adaptiveness of PML). Let f be a symmetric property and  P ⊆ ∆Xbe a label-invariant distribution collection. For any  p ∈ Pand  Φn ∼ p, with probability at least  1−O(1/√n),

image

Proof. For any tolerance  δ ∈ (0, 1)and distribution  p ∈ ∆X, define the  (δ, n)-typical cardinality of profiles with respect to p as the smallest cardinality  Cδ,n(p)of a set of length-n profiles such that the probability of observing a sample from p with a profile in this set is at least  1 − δ. The following lemma provides a tight characterization of  Cδ,n(p)in terms of the dimension of  Φn ∼ p.

Lemma 8. For any  p ∈ ∆Xand  Φn ∼ p, with probability at least  1 − 6/√n,

image

The proof of the lemma follows by recursively applying Theorem 6. Specifically, let  d := 2En(p) +3 log n, which is at least  Dn ∼ p, with probability at least  1 − 6/√n. Then,

image

where the last inequality holds with with probability at least  1 − 6/√n.

Now let f be a symmetric functional over P. According to Theorem 2, for any parameters  ε > 0and  δ ∈ (0, 1), if there exists an estimator ˆfsuch that

image

there is an estimator ˆfϕover  Φsatisfying

image

For an arbitrary length-n profile  φthat satisfies  PrΦn∼p(Φn = φ) ≥ 2δ, these error bounds yield

image

and since  PrΦn∼Pφ(Φn = φ) ≥ PrΦn∼p(Φn = φ) ≥ 2δby the definition of PML,

image

By the union bound and triangle inequality,

image

which basically upperly bounds the probability that

image

Next we will assume that there exists an estimator ˆfsatisfying

image

By Lemma 7, if  δ ≤ 1/10, we can construct another estimator ˆf ′that takes a sample of size  n =4mlog12δ log 1δ′ (nis assumed to be an integer here) and achieves a higher-confidence guarantee

image

Then by the above reasoning, with probability at least  1 − 6/√n,

image

For the first term on the right hand side to vanish as quickly as  1/√n, it suffices to have

image

Simplify the expressions and apply the union bound. It suffices to have both

image

If n satisfies these conditions, with probability at least  1 − Θ(1/√n),

image

Summary We have shown the following result, which is a strengthened version of Theorem 3. The result shows that the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes.

If there exists an estimator ˆfsuch that

image

for any  p ∈ Pand  Φn ∼ pwhere the sample size n satisfies both

image

with probability at least  1 − Θ(1/√n),

image

Alternative statement

Fix P and assume that  δ ≤ 1/10. Recall that  εf(n, δ)denotes the smallest error that can be achieved by the best estimator using a size-n sample with a  (1 − δ)-confidence guarantee. Below we provide an alternative statement of the above result, which is more compact in its form.

Then, draw a profile  Φn ∼ p. With probability at least  1 − Θ(1/√n),

image

Consolidating this with Theorem 1 yields that

image

Setting  δ = 1/10and suppressing it in expressions, we establish the desired guarantee: For any p ∈ Pand  Φn ∼ p, with probability at least  1 − O(1/√n),

image

Below are some comments in order.

1. The theorem holds for any symmetric properties, while nearly all previous works require the property to possess certain forms and be smooth.

2. The theorem trivially implies a weaker result in Acharya et al. [2017] where  ⌈H(Φn)⌉is replaced by  O(√n), an upper bound due to the formula of Hardy and Ramanujan [1918].

3. There is a polynomial-time approximation algorithm Charikar et al. [2019a] achieving the same guarantee as that stated in the theorem.

A.6 PML and Sorted Distribution Estimation

The arguments below basically follow by Hao and Orlitsky [2019a] and Hao and Orlitsky [2019d], in which we assume that |X| = O(n log n).

Let f be a function in  L1, the collection of Lipschitz functions over [0, 1]. Without loss of generality, we also assume that f(0) = 0. Let  η ∈ (0, 1)be a threshold parameter to be determined later. An η-truncation of f is a function

image

One can easily verify that  fη ∈ L1. Next, we find a finite subset of  L1so that the  η-truncation of any  f ∈ L1is close to at least one of the functions in this subset.

For an integer parameter s > 3 to be chosen later. Partition the interval  [0, η]into s disjoint sub-intervals of equal length, and define the sequence of end points as  zj := η · j/s, j ∈ ⌈s⌋where ⌈s⌋ := {0, 1, . . ., s}. Then, for each  j ∈ ⌈s⌋, we find the integer  j′such that  |fη(zj) − zj′|is minimized and denote it by  j∗. Since  fηis 1-Lipschitz, we must have  |j∗| ∈ ⌈j⌋. Finally, we connect the points  Zj := (zj, zj∗)sequentially. This curve is continuous and corresponds to a particular η-truncation ˜fη ∈ L1, which we refer to as the discretized  η-truncation of f. Intuitively, we have constructed an  (s+1)×(s+1)grid and “discretized” function f by finding its closest approximation

image

Therefore, for any  p ∈ P := ∆X, the corresponding properties of  fηand ˜fηsatisfy |fη(p) − ˜fη(p)| ≤ |X| · ηs,

where we slightly abuse the notation and write ˜fη(p) := �x ˜fη(px). Note that  |j∗| ∈ ⌈j⌋for all j ∈ ⌈s⌋, and ˜fη(z) = zs∗for  z ≥ η. While there are infinitely many  η-truncations, the cardinality of the discretized  η-truncations of functions in  L1is at most

image

Now we consider the task of estimating ˜fη(p)from  Xn ∼ p. By construction, the real function ˜fη(z)is a constant for  z ≥ η. In addition, the function is Lipschitz and hence for an absolute constant C and an arbitrary interval  I := [a, b] ⊆ [0, 1], one can construct an explicit polynomial g(z) of degree at most  d ∈ N, satisfying

image

Combining these facts with Theorem 5 in Hao and Orlitsky [2019d] shows that there exists an estimator ˆfη(Xn)that for all  p ∈ ∆Xand  εsatisfying  n = Ω(|X|/(ε2 log |X|))and  ε > 1/n,

image

where  λ ∈ (0, 1)is an absolute constant bounded away from 0 whose value can be arbitrarily small.

Then, by setting  η ≤ O(ε2/n1/2+λ), where the asymptotic notation hides a sufficiently small absolute constant, the right-hand side is at most  exp(−4√n). Then, Theorem 2 in this paper and Theorem 3 in Acharya et al. [2017] imply that the PML distribution  Pϕ(Xn)satisfies

image

Consider any  p ∈ ∆Xand  Xn ∼ pwith a profile  ϕ := ϕ(Xn). Consolidate the previous results, and apply the union bound and triangle inequality. With probability at least  1−exp(3s log s − √n), the PML plug-in estimator will satisfy

image

for all functions f in  L1.

Next we consider the “second part” of a function  f ∈ L1, namely,

image

Again, we can verify that ¯fγ ∈ L1. To establish the corresponding guarantees, we make use of the following result. Since the profile probability is invariant to symbol permutation, for our purpose, we can assume that  p(y) ≤ p(z)iff  Pϕ(x) ≤ Pϕ(y), for all  x, y ∈ X. Under this assumption, the following lemma relates  Pϕto p.

Lemma 9. For any  η ∈ (0, O(1/√n)), distribution  p ∈ ∆Xand sample  Xn ∼ pwith profile  ϕ,

image

The proof of this lemma follows from 1) the fact that empirical distribution satisfies such a guarantee with the probability bound being  exp(−4√n), where we ignore labelings and sort the empirical probabilities according to p; 2) the Cauchy-Schwarz inequality applied to bound the expected error of the empirical estimator and McDiarmid’s inequality used to bound the error probability; 3) a variant of Theorem 3 in Acharya et al. [2017] that addresses distribution estimation Das [2012].

Hence, with probability at least  1 − exp(−√n),

image

for all functions f in  L1.

Consolidate the previous results. By the triangle inequality and the union bound, with probability at least  1 − exp (3s log s − √n) − exp(−√n),

image

for all functions f in  L1. Now we can conclude that  ℓ<1 (p, pϕ)is also at most the error bound on the right-hand side. The reason is straightforward: Since with high probability, the above guarantee holds for all functions in  L1, it must also hold for the function that achieves the supremum in

image

It remains to balance the error bounds on the estimation and deviation probability. Recall that we assume  |X| ≤ O(n log n)since otherwise the theorem is trivial to prove. Set  s = ˜Θ(√n)such that 3s log s < √n/2. Then, the confidence lower bound becomes  1 − exp(√n/2) − exp(√n), and the deviation bound reduces to ˜O(√nη) + Θ(�1/(ηn)) + 2ε. The previous derivations also require that  η ≤ O(ε2/n1/2+λ)and  η ∈ (0, O(1/√n)). Setting  η = Θ(1/n3/4)yields the desired result.

A.7 PML and Uniformity Testing

For a finite domain X, denote by  uXthe uniform distribution over X. Given an error parameter  ε > 0and a sample  Xnfrom an unknown distribution  p ∈ ∆X, uniformity testing Goldreich and Ron [2011] aims to distinguish between the null hypothesis

image

and the alternative hypothesis

image

In the work of Hao and Orlitsky [2019a], it is shown that the following simple PML-based algorithm achieves the optimal  Θ(�|X|/ε2)sample complexity Paninski [2008] for uniformity testing, up to logarithmic factors of the alphabet size |X|. Note that we instantiate the distribution collection P as ∆X, and use 0 and 1 to indicate whether  H0or  H1is accepted.

image

Figure 1: Uniformity tester  TPML

In this section, we present another intriguing connection between the PML estimator and the uniformity testing problem. For any profile  ϕof length n, denote

image

Then for any accuracy  ε > 0, the following uniformity tester Diakonikolas et al. [2016a] is sample optimal up to logarithmic factors.

image

The following lemma connects the above algorithm to the PML. Lemma 10. Chan et al. [2015] For any profile  ϕ := ϕ(xn)that corresponds to a non-constant sequence  xn ∈ X ∗,

image

Let  p ∈ ∆Xbe an arbitrary discrete distribution. Recall that in Section A, we partition the unit interval into a sequence of ranges,

image

denote by  pIjthe number of probabilities  pxbelonging to  Ij, and relate  En(p)to an induced shape-reflecting quantity,

image

the sum of the effective number of probabilities lying within each range.

The simple expression of  HSn (p)shows that it characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,  HSn (p)closely approximates  En(p), the value around which  Dn ∼ pconcentrates (Theorem 6) and  H(Φn)lies (Thoerem 1). In this section, we use  HSn (p)as a proxy for both  H(Φn)and  Dn, and study its attributes and values.

To further our understanding of profile entropy and dimension, we investigate the analytical attributes of  HSn (p)concerning monotonicity and Lipschitzness. Then, we present tight upper and lower bounds on the value of  HSn (p)for a variety of distribution families.

B.1 Monotonicity

Among the many attributes that  HSn (p)possesses, monotonicity is perhaps most intuitive. One may expect a larger value of  HSn (p)as the sample size n increases, since additional observations reveal more information on the variability of probabilities. Below we confirm this intuition.

Theorem 9. For any  n ≥ m ≫ 1and  p ∈ ∆X,

image

Besides the above result that lowerly bounds  HSn (p)with  HSm(p)for  m ≤ n, a more desirable result is to upperly bound  HSn (p)with a function of  HSm(p). Such a result will enable us to draw a sample of size  m ≤ n, obtain an estimate of  HSm(p)from  Dm, and use it to bound the value of  HSn (p)and thus of  Dnfor a much larger sample size n.

With such an estimate, we can perform numerous tasks such as predicting the performance of PML in estimating symmetric properties when more observations are available, and the space needed for storing a longer sample profile. These applications are closely related to the recent works on learnability estimation Kong and Valiant [2018], Kong et al. [2019].

The next theorem provides a simple and tight upper bound on  HSn (p)in terms of  HSm(p). Theorem 10. For any  n ≥ m ≫ 1and  p ∈ ∆X,

image

Implications Before proceeding to the proof, we first present two simple implications.

image

In other words, the sequence  Am := HSm(p)/√m log m, m ≤ n, is monotonically decreasing and converges to  An. As we increase the value of  m, (√n log n · Am), which can be viewed as our estimate of  HSn (p), is getting more and more accurate. For the purpose of adaptive estimation, if  n = 2t, we can choose the sequence  m = 20, 21, . . . , 2t.

Proof. For clarity, we denote by p(m, j) the value of  pIjcorresponding to  HSm(p), and p(n, j) the value of  pIjcorresponding to  HSn (p). Furthermore, denote  r :=�(n/m)((log m)/ log n), which we assume is an integer. Then by the definition of  HS·,

image

The lower-bound part basically follows by reversing the above inequalities.

image

This completes the proof of the theorem.

B.2 Lipschitzness

Viewing  HSn (p)as a distribution property, we establish its Lipschitzness with respect to a weighted Hamming distance and the  ℓ1distance. Given two distributions  p, q ∈ ∆X, the vanilla Hamming distance is denoted by

image

The distance is suitable for being a statistical distance since there may be many symbols at which the two distributions differ, yet those symbols account for only a negligible total probability and has little effects on many induced statistics. To address this, we propose a weighted Hamming distance

image

The next result measures the Lipschitzness of  HSnunder  hW.

Theorem 11. For any integer n, and distributions p and q, if  hW(p, q) ≤ εfor some  ε ≥ 1/n, ��HSn (p) − HSn (q)�� ≤ ˜O(√εn).

Proof. Recall that the quantity of interest is

image

Given the bound of  hW(p, q) ≤ ε, we denote by  Y ⊆ Xthe collection of symbols x at which px ̸= qx. By definition, we have both �x∈Y px ≤ εand  �x∈Y qx ≤ ε. Below, we show that these symbols modify the value of  HSn (p)by at most  ˜O(√εn). By symmetry, the same claim also holds for the distribution q. Combined, these two claims yields the desired result.

First, we consider  x ∈ Ysatisfying  px = 0or  px ∈ I1 = (0, (log n)/n]. Such a symbol either does not contribute the value of  HSn (p), or affects only the value of the first term  min {pI1, log n}, which is at most log n. Hence the claim holds for this case.

Next, consider symbols  x ∈ Ysatisfying  px ∈ Ij = ((j − 1)2 log nn , j2 log nn ]for some  j ≥ 2and denote the collection of them by  Z ⊆ Y. By the above assumption, we have �x∈Z px ≤ ε. To maximize their impact on  HSn (p)under this constraint, we should set their values to be

image

for some J to be determined, where each  pjrepeats exactly j log n times. Then, the symbols in Z contributes at most �Jj=2 j log n = (log n)(J − 1)(J + 2)/2to  HSn (p), and the above constraint on the total probability mass bounds transforms to

image

Therefore in this case, the contribution is again ˜O(√εn), which completes the proof.

Replacing  max{px, qx}with  |px − qx|results in a common similarity measure – the  ℓ1distance. The next theorem is an analog to the above under this classical distance. Theorem 12. For any integer n, and distributions p and q, if  ℓ1(p, q) ≤ εfor some  ε ≥ 0,

image

where c is a constant in [1/3, 3]. Note that the inequality is significant iff  ε ≤ ˜Θ(1/n1/4), since the value of  HSn (p)is at most  O(√n log n)for all p.

By symmetry, it suffices to prove the following lemma. Lemma 11. For any integer n, and distributions p and q, if  ℓ1(p, q) ≤ εfor some  ε ≥ 0,

image

Proof. Consider the task of modifying p by at most  εand maximizing the increase in  HSn (p). For each j and each probability  px ∈ j, denote by  p′xthe modified value. Depending on the location of  p′x, there are three types of possible modifications as illustrated below.

For the first type, we still have  p′x ∈ Ij. This does not change the value of  pIjand hence does not increase  HSn (p).

For the second type, we have  p′x ∈ Ij−1or  p′x ∈ Ij+1. If  pIj ≤ j · log n, this will decrease the value of  min{pIj, j ·log n}by 1 and increase the value of  min{pIj−1, (j −1)·log n}or min{pIj+1, (j + 1)·log n}by at most one. Hence in this case, the value of  HSn (p)can only decrease. If  pIj > j · log n, then  min{pIj, j · log n} = j · log n. For a particular j, all such modifications can increase the value of  HSn (p)by at most  (j − 1) log n + (j + 1) log n =2j log n, which is twice the value of  min{pIj, j · log n}. Hence, all such modifications, when combined, increase the value of  HSn (p)by at most  2HSn (p).

For the third type, we have  p′x ∈ Iiand  |i − j| ≥ 2. If i < j, we require a probability massof at least  ((j − 1)2 log n − i2 log n)/n ≥ (i log n)/n, where  j ≥ 3. If i > j, we require a probability mass of at least  ((i − 1)2 log n − j2 log n)/n ≥ (i log n)/n. The number of such modifications that could lead to an increase in the value of  HSn (p)is at most i log n. For each i, let  cidenote the number of such modifications that will lead to an increase of HSn (p). Then, the total increase is  �i ci, each  ciis at most i log n, and the total required probability mass required is at least �i ci · (i log n)/n ≤ ε.

Let  {ci}be the optimal solution that maximizes �i ci. Assume that there are two indices i < j satisfying  ci < i log nand  cj > 0. Then, if we replace  ciand  cjby  ci + 1and  cj − 1, respectively, �i ciwill not change and  �i ci · (i log n)/nwill decrease. Hence, we can assume that there exists  i′satisfying  ci = i log n, ∀i < i′and  ci = 0, ∀i > i′. In addition, assuming  εn ≥ log nimplies that  i′ ≥ 2. Hence, we have �i ci ≤ (log n)i′(i′ + 1)/2and

image

Profile Entropy for Structured Families

Following the study of attributes of profile entropy, we derive below nearly tight bounds on the HSn (p)values of three important structured families, log-concave, power-law, and histogram. These bounds tighten up and significantly improve those in Hao and Orlitsky [2019c], and show the ability of profile entropy in charactering natural shape constraints.

For the remaining sections, we follow the convention and specify structured distributions over X = Z.

B.3 Log-Concave Distributions

We say a discrete distribution  p ∈ ∆Zis log-concave if p has a contiguous support over Z and the inequality  p2x ≥ px−1px+1holds for all symbols  x ∈ Z.

The log-concave family encompasses a broad range of discrete distributions, such as Poisson, hyperPoisson, Poisson binomial, binomial, negative binomial, geometric, and hyper-geometric, with wide applications to numerous research areas, including statistics Saumard and Wellner [2014], computer science Lovász and Vempala [2007], economics An [1997], algebra, and geometry Stanley [1989].

The next result upperly bounds the profile entropy of log-concave families, and is tight up to logarithmic factors of n. Theorem 13. For any  n ∈ Zand distribution  p ∈ ∆Z, if p is log-concave and has a variance of  σ2,

image

Proof. The  O(log n)(1 + σ)upper bound is established in Hao and Orlitsky [2019c] using some concentration attributes of the log-concave distributions.

For the other component, we can assume that  σ ≥ √nand n is larger than some absolute constant. Then by Diakonikolas et al. [2016b], the maximum probability  pmaxof p belongs to  [1/(8σ), 1/σ]. Hence, the last index J for which  pIJ ̸= 0satisfies

image

Hence, we have

image

This upper bound is uniformly better than the  min{σ, (n2/σ)1/3}bound in Hao and Orlitsky [2019c]. Theorem 17 further shows that it is optimal up to logarithmic factors of n.

A similar bound holds for t-mixtures of log-concave distributions. More concretely,

Theorem 14. For any integer n and distribution  p ∈ ∆Z, if p is a t-mixture of log-concave distributions each has a variance of  σ2i, where i = 1, . . . , t,

image

B.4 Discretization of Continuous Distributions

The introduction about log-concave families covers numerous classical discrete distributions, yet leaves many more continuous ones untouched Bagnoli and Bergstrom [2005]. Below, we present a discretization procedure that preserves distribution shapes such as monotonicity, modality, and log-concavity. Applying this procedure to the Gaussian distribution  N(µ, σ2)further shows the optimality of Theorem 13.

Let X be a continuous random variable with density function f(x). For any  x ∈ R, denote by  ⌈x⌋the closest integer z such that  x ∈ (z −1/2, z+1/2]. The distribution of  ⌈X⌋is over Z and satisfies

image

We refer to this random variable  ⌈X⌋as the discretized version of X.

Shape Preservation By definition, one can readily verify that the above transformation preserves several important shape characteristics of distributions, such as monotonicity, modality, and k-modality (possibly yields a smaller k). The following theorem covers log-concavity.

Theorem 15. For any continuous random variable X over R with a log-concave density f, the distribution  p ∈ ∆Zassociated with  ⌈X⌋is also log-concave.

To show this, we need the following basic lemma about concave functions. Lemma 12. For real numbers  x1, x2, y1,and  y2satisfying  x1 ≤ x2, y1 ≤ y2, x1 < y1, and x2 < y2,

image

Following the lemma, for  x, y ∈ Rsuch that  |x − y| ≤ 1, and any function f that is log-concave,

image

Proof. By definition p is log-concave if p has a consecutive support and  p(z)2 ≥ p(z+1)p(z−1), ∀z. For  ⌈X⌋, the first condition is satisfied since X is has a continuous support on R, and p(z) is positive as long as f(x) > 0 for a non-empty sub-interval of  (z − 1/2, z + 1/2].

Below we show that p satisfies the second condition. Specifically, for any  z ∈ Z,

image

where the inequality follows by the above lemma and its implication.

Moment preservation Denote by p the distribution of  ⌈X⌋for  X ∼ f. Let  µand  σ2be the mean and variance of density f, given that they exist. The theorem below shows that distribution p has, within small additive absolute constants, a mean of  µand variance of  Θ(σ2). Theorem 16. Under the aforementioned conditions, the mean of  ⌈X⌋satisfies

image

and the variance of  ⌈X⌋satisfies

image

Proof. First consider the mean value of  ⌈X⌋for  X ∼ f. We have

image

Next consider the variance of  ⌈X⌋. Applying the inequality  (a + b)2 ≤ 2(a2 + b2)yields

image

By the symmetry in the above reasoning, we also have

image

Consolidating these inequalities shows that

image

B.5 Optimality of Theorem 13

By the above formula, the discretized Gaussian  ⌈N⌋(µ, σ2)has a distribution in the form of

image

Consolidating Theorem 15 and 16 shows that  pGis a log-concave distribution with a variance of Θ(σ2) ± 1. Consequently, Theorem 13 yields the following upper bound:

image

The optimality of Theorem 13 follows by these inequalities.

Proof. At it is clear from the context, we will write p instead of  pG. Recall that

image

where  pIjdenotes the number of probabilities belonging to  Ij = ((j − 1)2, j2] · (log n)/n. Considering part of the distribution can only reduce the value of  HSn (p). Hence, we focus on the symbols in the range  (µ + 1, ∞) ∩ Z, over which the probability mass function p(z) is monotone.

We will further assume that  n/ log n ≫ σ ≫ log n, since otherwise the right-hand side of the inequality reduces to O(1), and the result follows by the fact that  HSn (p) ≥ 1for all n and p.

In addition, we focus on  j ≫ 1in the following argument, as the contribution to  HSn (p)from these indices is no more than the total  HSn (p).

Given these assumptions, we have

image

where  c(σ, n) := log�n/(√2πσ log n)�and the interval is well-defined iff

image

For clarity, we divide our analysis into two cases: √n ≥ σ ≫ log nand  n/ log n ≫ σ > √n.

For the first case and  j ≤�σ/ log n/2 ≤�n/(σ log n)/2, the length  Ljof the above interval, which equals to  pIjup to an additive slack of 2, satisfies

image

Therefore, we have  Lj = Ω(σ/(j log n)). Since  σ ≫ log nensures  Lj ≥ 3and  j ≤�σ/ log n/2is equivalent to  σ ≥ 4j2 log n, the lower bound on  Ljtransforms into  pIj ≥ Ω(j). Hence in this case,  HSn (p)admits the following bound

image

In the  n/ log n ≫ σ > √ncase, we have�σ/ log n >�n/(σ log n). Repeating the previous reasoning for  j ≤�n/(σ log n)/2, we again obtain  Lj = Ω (σ/(j log n))and  pIj ≥ Ω(j).

Therefore,

image

Finally, note that in the first case,  min{σ, n/σ} = σ, while in the second,  min{σ, n/σ} = n/σ.

Consolidating these results yields the desired lower bound

image

B.6 Power-Law Distributions

We say that a discrete distribution  p ∈ ∆Zis a power-law with power  α > 0if p has a support of [k] := {1, . . ., k} for some  k ∈ Z+ ∪ {∞}and  px ∝ x−αfor all  x ∈ [k].

Power-law is a ubiquitous structure appearing in many situations of scientific interest, ranging from natural phenomena such as the initial mass function of stars Kroupa [2001], species and genera Humphries et al. [2010], rainfall Machado and Rossow [1993], population dynamics Taylor [1961], and brain surface electric potential Miller et al. [2009], to man-made circumstances such as the word frequencies in a text Baayen [2002], income rankings Dr˘agulescu and Yakovenko [2001], company sizes Axtell [2001], and internet topology Faloutsos et al. [1999].

Unlike log-concave distributions that concentrate around their mean values, power-laws are known to possess “long-tails” and always log-convex. Hence, one may expect the profile entropy of power-law distributions to behave differently from that of log-concave ones. The next theorem justifies this intuition and provides tight upper bounds.

Theorem 18. For a power-law distribution  p ∈ ∆[k]with power  α, we have

image

where

image

The above upper bound fully characterizes the profile entropy of power-laws and surpasses the basic {k, √n log n}bound for both  k ≫ √nand  k ≪ √n. In comparison, Hao and Orlitsky [2019c] yields a  O(nmin{1/(1+α),1/2})upper bound, which improves over √n log nonly for  α > 1and is worse than that above for all  α < 1+1/log k.

Proof. For the ease of exposition, write the probability of symbol i assigned by distribution p as pi := c−1α · i−α, where  cαis a normalizing constant (implicitly depends on k) and k can be infinite. Recall that the quantity of interest is

image

First consider  pIjfor a sufficiently large j and note that

image

where  c(α, n) := (cα log n)/n. Observe that the length  Ljof interval  I′j, which differs from the value of  pIjby at most 2, is proportional to  (j −1)−2/α −j−2/α, and hence is a decreasing function of j. Furthermore, each term  min{pIj, j · log n} ≈ min{Lj, j · log n}is basically the minimum between this decreasing function and j log n, an increasing function of j. This naturally calls for determining the value of j at which the two functions are equal. Concretely,

image

Let J denote the middle quantity on the right-hand side (implicitly depends on  αand n). We can decompose the summation  HSn (p)into two parts. The first part consists of indices  j ≤ J,

image

Correspondingly, the second part consists of indices  j ≥ J. For these indices j, we have  Lj ≤j · log n. Recall that  I′jspecifies the range of i satisfying  pi ∈ Ij. Then the second part satisfies

image

where the first inequality follows by the fact that the intervals  I′jare consecutive. Also note that the boundary case j = J is covered in both  HSn,1and  HSn,2under different conditions. Depending on the value of the normalizing constant  cα, the following implications are immediate and apply to all power parameters  α > 0. If  cα ≤ ˜O(n)α1/(1+α),

image

These bounds can be refined given the knowledge of k. Note that for  α ≤ 1, the support size k must be finite for the normalizing constant to be well-defined. On the other hand, for  α > 1, it is common to assume  k = ∞, which we adopt here.

Then, for  α > 1, the above bound derivations yield

image

where we lower bound  cαby  max{1, 1/(α − 1)}.

Next, we improve the upper bound for  α < 1. Note that the normalizing constant admits

image

Then for  k ≥ √n, the previous upper bound yields

image

where we utilize the inequality  ((k + 1)1−α − 1)/(1 − α) ≥ (k + 1)1−α/e. Furthermore, one can bound  HSn (p)by k since it is at most the sum of  pIj. Combined, these two results yield

image

To complete the picture, we consider the case of  α = 1. Note that  cα = �ki=1 i−1 > log k. Hence for  α = 1, the above reasoning implies that

image

Finally, we also analyze the case of  α > 1with truncation at k. Our analysis mainly relies on the following inequality that provides a tight lower bound on  cα.

image

To summarize, for different values of  α, the quantity  HSn (p)admits

image

where

image

Note that unless  α < 1and  k ≈ √n, all the bounds are better than  Θ(√n). In addition, the derivation above already shows that the bounds are tight up to logarithmic factors. Reorganizing the terms yields the desired result: For any power-law distribution p with power  α ≥ 0,

image

As a remark, for  α = 0, the distribution becomes a uniform distribution with support size k. The above result also covers this case, for which the upper bound simplifies to  k ∧ (n/k).

B.7 Histogram Distributions

A distribution  p ∈ ∆Xis a t-histogram distribution if there is a partition of X into t parts such that p has the same probability value over all symbols in each part.

Besides the long line of research on histograms reviewed in Ioannidis [2003], the importance of histogram distributions rises with the rapid growth of data sizes in numerous engineering and science applications in the modern era.

For example, in scenarios where processing the complete data set is inefficient or even impossible, a standard solution is to partition/cluster the data into groups according to the task specifications and element similarities, and randomly sample from each group to obtain a subset of the data to use. This naturally induces a histogram distribution, with each data point being a symbol in the support.

The work of Hao and Orlitsky [2019c] studies the class of t-histogram distributions and obtains the following upper bound

image

Our contribution is establishing its optimality. Theorem 19. For any  t, n ∈ Z+, there exists a t-histogram distribution p such that

image

Note that uniform distributions correspond to 1-histograms, for which the bounds reduce to ˜Θ(n1/3).

Proof. Again, recall that the quantity of interest is

image

Our construction depends on the value of t as follows. Let  A · {B}denote the length-A constant sequence of value B. If t = 1, then the distribution p has the following form

image

where  p0is a properly chosen probability in  In1/3so that p is well-defined, and the range of support of distribution p is irrelevant for our purpose and hence unspecified. If  2 ≤ t < n1/4/(2√log n), then for some parameter  s ≥ 0to be determined, the distribution p has the following form

image

where the probability values are sorted according to the ordering they appear above, and L is a properly chosen to make the probabilities sum to 1. For the distribution to be well-defined, we require

image

where the last inequality is valid given that  t < n1/4/(2√log n). Let s be the maximum integer satisfying the inequality above. Then, the quantity  HSn (p)admits the lower bound

image

Finally, if  t ≥ n0 := n1/4/(2√log n), then the distribution p has the following form

image

where  p0is a properly chosen to make the probabilities sum to 1. By the reasoning for the last case, the distribution p is well-defined. In addition, the quantity  HSn (p)satisfies

image

Consolidating these results yields the desired lower bound.

C.1 Competitive Distribution Estimation

Competitive estimation calls for an estimator that competes with the instance-by-instance performance of a genie knowing more information, but reasonably restricted. Denote by  ℓKL(p, q)the KL divergence. Introduced in Orlitsky and Suresh [2015], the formulation considers the collection N of all natural estimators, and shows that a simple variant  ˆpGTof the Good-Turing estimator achieves

image

for every distribution p and with high probability. We refer to the left-hand side as the excess loss of estimator  ˆpGTwith respect to the best natural estimator, and note that it vanishes at a rate independent of p. For a more involved estimator in Acharya et al. [2013], the excess loss vanishes at a faster rate of ˜O(min{1/√n, |X|/n}), which is optimal up to logarithmic factors for every estimator and the respective worst-case distribution. For the  ℓ1distance, the work of Valiant and Valiant [2016] derives a similar result.

These estimators track the loss of the best natural estimator for each distribution. Yet an equally important component, the excess loss bound, is still of the worst-case nature. For a fully adaptive guarantee, Hao and Orlitsky [2019c] design an estimator  ˆp⋆that achieves a  Dn/nexcess loss, i.e.,

image

for every p and  Xn ∼ p, with high probability. Utilizing the adaptiveness of  Dnto the simplicity of distributions, the paper derives excess-loss bounds for several important distribution families, and proves the estimator’s optimality under various of classical and modern learning frameworks.

New results While the work of Hao and Orlitsky [2019c] provides an appealing upper bound on the excess loss, it is not exactly clear how good this bound is as a matching lower bound is missing. In this work, we complete the picture by showing that the  Dn/nbound is essential for competitive estimation and optimal up to logarithmic factors of n. Theorem 5 (Minimal excess loss). For any  n, D ∈ Nand distribution estimator  ˆp′, there is a distribution p such that with probability at least 9/10, we have both

image

and

image

According to Theorem 1, we can replace  Dnby multiples ˜Θ(H(Φn))of the profile entropy in both the upper and lower bounds.

image

where  D ≲�n/ log nfor the total to be at most n. Let  A · {B}denote the length-A constant sequence of value B. Let C be the set of distributions in the form of

image

where the probability values are sorted according to the ordering they appear above, L is a proper variable that makes the probabilities sum to 1, and the range of support of distribution p is irrelevant for our purpose and hence unspecified. Equip a uniform prior over C (equivalently, construct a random distribution). We have several claims in order:

image

For any  i ∈ Iand  µ ∈ Ui, the value of  Mµ, the total probability of symbols appearing  µtimes, is  qiif  ϕµ = 1and  qiis chosen; and is  q′iif  ϕµ = 1and  qiis chosen. Any estimator Eµwill incur an expected absolute error of  Ω(i(log n)/n)in estimating  Mµgiven  ϕµ = 1.

Note that for any α  ∈ [0, 1]and x, y > 0, α(y − z)2 + (1 − α)(z − x)2 ≥ α(1 − α)(x − y)2.

Therefore, the expected squared Hellinger distance  H2(·, ·)of any estimator  Eµin estimating  (Mµ)µ≥0satisfies

image

image

C.2 Competitive Entropy Estimation

The next theorm shows that for every distribution and among all plug-in entropy estimators, the distribution estimator in Hao and Orlitsky [2019c] is as good as the one that performs best in estimating the actual distribution.

Denote by N the collection of all natural estimators. Write  |H(p) − H(q)|as  ℓH(p, q)for compactness and the KL-divergence between  p, q ∈ ∆Xas  ℓKL(p, q).

Theorem 4 (Competitive entropy estimation). For any distribution p, sample  Xn ∼ pwith profile Φn := ϕ(Xn), and  ˆpNXn := arg minˆp∈N ℓKL(p, ˆpXn ), we have

image

with probability at least  1 − O(1/n).

Proof. Given any natural estimator and a sample  Xn ∼ p, we denote by q the distribution estimate. The entropy of q differs from the true entropy by

image

Denote by  Pµ(Xn)and  Qµ(Xn)the total probability that distributions p and q assign to symbols with multiplicity  µ. Since q is induced by a natural estimator, we also write  qµ(Xn)for the probability that q assigns to each symbol with multiplicity  µin  Xn. Recall that prevalence  ϕµ(Xn)denotes the number of symbols with multiplicity  µin  Xn. Therefore,  Qµ(Xn) = ϕµ(Xn) · qµ(Xn). Henceforth, whenever it is clear from the context, we suppress  Xnin related expressions. Then, the second term on the right-hand side satisfies

image

Let  qminbe the smallest nonzero probability of q. By the triangle inequality and Pinsker’s inequality,

image

By definition,  ˆpNXn = arg minˆp∈N ℓKL(p, ˆpXn ). Now we show that if a symbol x has multiplicity  µ, the estimator  ˆpNwill assign a probability mass of  Pµ/ϕµ. In other words, ˆP Nµ = Pµsince  pN ∈ N. Indeed, the corresponding KL-divergence values differ by

image

Then, the above equalities yield that,

image

Next consider the other estimator  ˆp⋆, which is also natural. Let  Dn = D(Φn)be the profile dimension of  Xn. By the results in Hao and Orlitsky [2019c], estimator  ˆp⋆achieves a  Dn/nexcess loss, i.e.,

image

for every p and  Xn ∼ p, with probability at least  1 − O(1/n). In addition, by its construction, the minimum probability  ˆp⋆min(Xn)is at least  1/n4. Therefore, with probability at least  1 − O(1/n),

image

Finally, the triangle inequality combines the above results and yields

image

This together with Theorem 1 completes the proof.

While a labeled sample contains all information, for many modern applications, such as property estimation and differential privacy, it is sufficient Orlitsky et al. [2004] or even necessary to provide only the profile Suresh [2019]. Hence, this section focuses on the lossless compression of profiles.

For any distribution p, it is well-known that the minimal expected codeword length (MECL) for losslessly compressing a sample  Xn ∼ pis approximately nH(p), which increases linearly in n as long as H(p) is bounded away from zero.

On the other hand, by the Hardy-Ramanujan formula Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies

image

Consequently, the MECL for losslessly compressing the sample profile  Φn ∼ pis at most  O(√n), a number potentially much smaller than nH(p).

By Shannon’s source coding theorem, the profile entropy  H(Φn)is the information-theoretic limit of MECL for the lossless compression of profile  Φn ∼ p. Below, we present explicit block and sequential profile compression schemes achieving this entropy limit, up to logarithmic factors of n.

D.1 Block Compression

The block compression algorithm we propose is intuitive and easy to implement.

Recall that the profile of a sequence  xnis the multiset  ϕ(xn)of multiplicities associated with symbols in  xn. The ordering of elements in a multiset is not informative. Hence equivalently, we can compress  ϕ(xn)into the set  C(ϕ(xn))of corresponding multiplicity-prevalence pairs, i.e.,

image

The number of pairs in  C(ϕ(xn))is equal to the profile dimension  D(ϕ(xn)). In addition, both a prevalence and its multiplicity are integers in [0, n], and storing the pair takes 2 log n nats. Hence, it takes at most  2(log n) · D(ϕ(xn))nats to store the compressed profile. By Theorem 1, for any distribution  p ∈ ∆Xand  Φn ∼ p,

image

D.2 Sequential Compression

For any sequence  xn, the setting for sequential profile compression is that at time step  t ∈ [n], the compression algorithm knows only  ϕ(xt)and sequentially encodes the new information. This is equivalent to providing the algorithm  µxt(xt−1)at time step t.

Suppress  x, xtin the expressions for the ease of illustration. For efficient compression, we sequentially encode the profile  ϕinto a self-balancing binary search tree T , with each node storing a multiplicity-prevalence pair  (µ, ϕµ)and  µbeing the search key. We present the algorithm details as follows.

image

The algorithm runs for exactly n iterations, with a O(log n) per-iteration time complexity. For an i.i.d. sample  Xn ∼ p, the expected space complexity is again ˜Θ(⌈H(Φn)⌉).

E.1 Multi-Dimensional Profiles

As we elaborate below, the notion of profile generalizes to the multi-sequence setting.

Let X be a finite or countably infinite alphabet. For every  ⃗n := (n1, . . . , nd) ∈ Ndand tuple x⃗n := (xn11 , . . . , xndd )of sequences in  X ∗, the multiplicity  µy(x⃗n)of a symbol  y ∈ Xis the vector of its frequencies in the tuple of sequences. The profile of  x⃗nis the multiset  ϕ(x⃗n)of multiplicities of the observed symbols Acharya et al. [2010], Das [2012], Charikar et al. [2019b], and its dimension is the number  D(x⃗n)of distinct elements in the multiset. Drawing independent samples from  ⃗p :=(p1, . . . , pd) ∈ ∆dX, the profile entropy is simply the entropy of the joint-sample profile.

Many of the previous results potentially generalize to this multi-dimensional setting. For example, the√2nbound on  D(x⃗n)in the 1-dimensional case becomes Theorem 20. For any  X, ⃗n, and  x⃗n ∈ X ⃗n, there exists a positive integer r such that

image

and

image

This essentially recovers the√2nbound for d = 1.

Proof. For simplicity, we suppress  x⃗nin  D(x⃗n). Let  ∆ddenote the standard d-dimensional simplex. As each multiplicity corresponds to a vector in  Nd, in the ideal case, the profile that has the maximum dimension D corresponds to the integer vectors in the scaled simplex  (r · ∆d), for some properly chosen parameter r. For the minimum value of such a parameter  r ∈ Z+, we have

image

and

image

Consolidating these two inequalities yields the desired result.

E.2 Discrete Multi-Variate Gaussian

Given a mean vector  µ ∈ Zdand covariance matrix  Σ ∈ Rd×dwith eigenvalues at least 1, the corresponding discrete d-dimensional Gaussian is specified by its probability mass function

image

where C > 0 is a normalizing constant. Note that definition is slightly different from that induced by the discretization procedure presented in Section B.4. The reason for adopting this definition (which is also standard in literature) is to simplify the subsequent reasoning. Let  σ21 ≤ σ22 . . . ≤ σ2dbe the d eigenvalues of  Σ, where  σ21 ≥ 1by assumption. In this section, we show that for  d ≥ 9,

image

where  αΣ := exp�6σ2d/σ21�and  βd,n :=�(2 log n)/d, and  γdis a constant that depends only on d, which appears in Lemma 14. Note that the above bound resembles that for univariate log-concave distributions (Theorem 13). This result is not included in the main paper due to the different setting. Lower bound on C First we bound the value of C from below in terms of its eigenvalues and other parameters. By symmetry, we can decompose the matrix  Σas

image

where  Λis a diagonal matrix with  Λii = σ2i, and V is an orthonormal matrix whose i-th column is the eigenvector  viassociated with  σ2i.

Partition the real space  Rdinto unit cubes whose vertices belong to  Zd. For any two vectors  ˜a,˜b ∈Rdthat belong to the same unit cube, we want to bound the ratio between  p(˜a)and  p(˜b). Denote a := ˜a − µand  b := ˜b − µ, and express a and b as linear combinations of eigenvectors,

image

The log-ratio between the corresponding probabilities satisfies

image

Note that �i(xi − yi)2 = ∥a − b∥22 = �i(˜ai − ˜bi)2 ≤ dsince  ˜a − ˜b = a − band  ˜a,˜bbelong to the same unit cube. Hence, we bound the absolute value of the ratio by

image

Now, consider the hyper-ellipse E induced by

image

For any  x ∈ E, simple algebra shows that  ∥x − µ∥22 ≤ dσ2d. Hence by the previous discussion, for any unit cube U with vertices in  Zd, there exists a vertex  vUof U such that for any  x ∈ U ∩ E,

image

Note that  x ∈ Eis equivalent to  p(x) ≥ exp(−d/2)/C. The probability mass over E is at least

image

image

Consolidating the lower and upper bounds and multiplying both sides by C yield

image

where the first implication follows by the lemma below.

image

Upper bound We proceed to bound  HSn (p) = �j≥1 min�pIj, j · log n�.

Below we assume that C < n/ log n, since otherwise  p(x) ≤ (log n)/n, ∀x, yielding an O(log n) upper bound on  HSn (p). Then by definition, the last index j such that  pIj > 0satisfies

image

Denote by J the quantity on the right-hand side. Then,

image

Furthermore, by a reasoning similar to that above, the collection of points  x ∈ Zdsatisfying  p(x) ≤1/(Cn) = p(µ)/n ≤ 1/ncontributes at most O(log n) to  HSn (p). Hence we need to analyze only points x satisfying p(x) > 1/(Cn). Equivalently, points in

image

Clearly, these points contribute at most  |E⋆|to the sum. Noting that  E⋆is a discrete hyper-ellipse, we can bound its cardinality via the following lemma Bentkus and Götze [1997].

Lemma 14. Let  µ ∈ Rdbe a mean vector, and  Σ ∈ Rd×dbe a real covariance matrix with nonzero eigenvalues  σ21 ≤ . . . σ2d. For any  d ≥ 9and  t ≥ σ2d, the discrete ellipsoid

image

admits the following inequality on its cardinality,

image

where  γd > 1is a constant that depends only on d.

For simplicity, write  αΣ := exp�6σ2d/σ21�and  βd,n :=�(2 log n)/d. Applying the above lemma to bound  |E⋆|(where t = 2 log n) and combining the result with our lower bound on C yield

image

where the second step follows by Lemma 13.

image

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Shengjun Pan, and Narayana P Santhanam. Clas- sification using pattern probability estimators. In 2010 IEEE International Symposium on Information Theory, pages 1493–1497. IEEE, 2010.

Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. In Proceedings of the 24th Annual Conference on Learning Theory, pages 47–68, 2011.

Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Suresh. Competitive classification and closeness testing. In Conference on Learning Theory, pages 22–1, 2012.

Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal prob- ability estimation with applications to prediction and classification. In Conference on Learning Theory, pages 764–796, 2013.

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pages 11–21, 2017.

Mark Yuying An. Log-concave probability distributions: Theory and statistical testing. Duke University Dept of Economics Working Paper, 95(3), 1997.

Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Luis Flores, Jose Antonio Lozano, Yves Van de Peer, Rosa Blanco, Víctor Robles, Concha Bielza, et al. A review of estimation of distribution algorithms in bioinformatics. BioData mining, 1(1):6, 2008.

Robert L Axtell. Zipf distribution of US firm sizes. science, 293(5536):1818–1820, 2001.

R Harald Baayen. Word frequency distributions, volume 18. Springer Science & Business Media, 2002.

Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic theory, 26(2):445–469, 2005.

Alexander Barvinok. Computing the permanent of (some) complex matrices. Foundations of Computational Mathematics, 16(2):329–342, 2016.

Alexander I Barvinok. Two algorithmic results for the traveling salesman problem. Mathematics of Operations Research, 21(1):65–84, 1996.

Vidmantas Bentkus and Friedrich Götze. On the lattice point problem for ellipsoids. Acta Arithmetica, 80(2):101–125, 1997.

Peter Bühlmann, Petros Drineas, Michael Kane, and Mark van der Laan. Learning structured distri- butions Ilias Diakonikolas. In Handbook of Big Data, pages 283–300. Chapman and Hall/CRC, 2016.

Chun Lam Chan, Winston Fernandes, Navin Kashyap, and Manjunath Krishnapur. Phase transitions for the uniform distribution in the PML problem and its Bethe approximation. arXiv preprint arXiv:1506.00753, 2015.

Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The Bethe approximation for structured matrices: an improved approximation for the profile maximum likelihood. In NeurIPS 2019 Workshop on Information Theory and Machine Learning, 2019a.

Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 780–791, 2019b.

Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999.

Hirakendu Das. Competitive tests and estimators for properties of distributions. PhD thesis, UC San Diego, 2012.

Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. arXiv preprint arXiv:1611.03579, 2016a.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Efficient robust proper learning of log- concave distributions. arXiv preprint arXiv:1606.03077, 2016b.

Adrian Dr˘agulescu and Victor M Yakovenko. Exponential and power-law probability distributions of wealth and income in the United Kingdom and the United States. Physica A: Statistical Mechanics and its Applications, 299(1-2):213–221, 2001.

Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law relationships of the internet topology. ACM SIGCOMM computer communication review, 29(4):251–262, 1999.

Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68–75. Springer, 2011.

Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. arXiv preprint arXiv:1802.08405, 2018.

Yi Hao and Alon Orlitsky. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems, pages 10989–11001, 2019a.

Yi Hao and Alon Orlitsky. Data amplification: Instance-optimal property estimation. arXiv preprint arXiv:1903.01432, 2019b.

Yi Hao and Alon Orlitsky. Doubly-competitive distribution estimation. In International Conference on Machine Learning, pages 2614–2623, 2019c.

Yi Hao and Alon Orlitsky. Unified sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems, pages 11104–11114, 2019d.

Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data amplification: A unified and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pages 8834–8843, 2018.

Godfrey H Hardy and Srinivasa Ramanujan. Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society, 2(1):75–115, 1918.

Jean Hausser and Korbinian Strimmer. Entropy inference and the James-Stein estimator, with appli- cation to nonlinear gene association networks. Journal of Machine Learning Research, 10(Jul): 1469–1484, 2009.

Erwan Hillion, Oliver Johnson, et al. A proof of the Shepp–Olkin entropy monotonicity conjecture. Electronic Journal of Probability, 24, 2019.

Nicolas E Humphries, Nuno Queiroz, Jennifer RM Dyer, Nicolas G Pade, Michael K Musyl, Kurt M Schaefer, Daniel W Fuller, Juerg M Brunnschweiler, Thomas K Doyle, Jonathan DR Houghton, et al. Environmental context explains Lévy and Brownian movement patterns of marine predators. Nature, 465(7301):1066–1069, 2010.

Yannis Ioannidis. The history of histograms (abridged). In Proceedings 2003 VLDB Conference, pages 19–30. Elsevier, 2003.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Weihao Kong and Gregory Valiant. Estimating learnability in the sublinear data regime. In Advances in Neural Information Processing Systems, pages 5455–5464, 2018.

Weihao Kong, Gregory Valiant, and Emma Brunskill. Sublinear optimal policy value estimation in contextual bandits. arXiv preprint arXiv:1912.06111, 2019.

Pavel Kroupa. On the variation of the initial mass function. Monthly Notices of the Royal Astronomical Society, 322(2):231–246, 2001.

Erich Leo Lehmann. Some concepts of dependence. The Annals of Mathematical Statistics, pages 1137–1153, 1966.

László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algo- rithms. Random Structures & Algorithms, 30(3):307–358, 2007.

LAT Machado and WB Rossow. Structural characteristics and radiative properties of tropical cloud clusters. Monthly Weather Review, 121(12):3234–3260, 1993.

Anne E Magurran. Measuring biological diversity. John Wiley & Sons, 2013.

Kai J Miller, Larry B Sorensen, Jeffrey G Ojemann, and Marcel Den Nijs. Power-law scaling in the brain surface electric potential. PLoS computational biology, 5(12), 2009.

Michael Mitzenmacher. A brief history of generative models for power law and lognormal distribu- tions. Internet mathematics, 1(2):226–251, 2004.

Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is Good- Turing good. In Advances in Neural Information Processing Systems, pages 2143–2151, 2015.

Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 426–435. AUAI Press, 2004.

Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.

Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.

Thomas Schürmann and Peter Grassberger. Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414–427, 1996.

Richard P Stanley. Log-concave and unimodal sequences in algebra, combinatorics, and geometry. Ann. New York Acad. Sci, 576(1):500–535, 1989.

Ananda Theertha Suresh. Differentially private anonymized histograms. In Advances in Neural Information Processing Systems, pages 7969–7979, 2019.

Lionel Roy Taylor. Aggregation, variance and the mean. Nature, 189(4766):732–735, 1961.

Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685–694, 2011.

Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142–155, 2016.

Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pages 2157–2165, 2013.

Sergio Verdú. Empirical estimation of information measures: A literature guide. Entropy, 21(8): 720, 2019.

Pascal O Vontobel. The Bethe approximation of the pattern maximum likelihood distribution. In 2012 IEEE International Symposium on Information Theory Proceedings. IEEE, 2012.

Pascal O Vontobel. The Bethe and Sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the Valiant-Valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2014.


Designed for Accessibility and to further Open Science