Recent research in statistical machine learning, ranging from neural-network training and online learning, to density estimation and property testing, has advanced evaluation criteria beyond worst-case analysis. New performance measures apply more refined metrics relating the algorithm’s accuracy and efficiency to the problem’s inherent structure.
Consider for example learning an unknown discrete distribution from its i.i.d. samples. Classical worst-case analysis states that in the worst case, the number of samples required to estimate a distribution to a given KL-divergence grows linearly in the alphabet size.
However, this formulation is pessimistic. Distributions are rarely the worst possible, and many practical distributions can be estimated with significantly smaller samples. Furthermore, once the sample is drawn, it reveals the distribution’s complexity and hence the hardness of the learning task.
Going beyond worst-case analysis, we design an adaptive learning algorithm whose theoretical guarantees vary according to the problem’s simplicity. For example, Orlitsky and Suresh [2015] recently proposed an estimator that instance-by-instance achieves nearly the same performance as a genie algorithm designed with prior knowledge of the underlying distribution.
We introduce profile entropy, a fundamental measure for the complexity of discrete distributions, and show that it connects three vital scientific tasks: estimation, inference, and compression. The resulting algorithms have guarantees directly relating to the data profile entropy, hence also adapt to the intrinsic simplicity of the tasks at hand.
The next subsections, formalize the relevant concepts and present relevant prior works.
1.1 Sample Profiles and Their Entropy
Consider an arbitrary sequence over a finite or countably infinite alphabet X. The multiplicity
of a symbol
is the number of times y appears in
. The prevalence of an integer
is the number
of symbols in
with multiplicity
. The profile of
is the multiset
of multiplicities of the symbols in
. We refer to it as a profile of length n.
The number D(S) of distinct elements in a multiset S is its dimension. For convenience, we also write for profile dimension. Note that the dimension of a length-n profile is at most
.
Let be the collection of distributions over X, and p be an arbitrary distribution in
. The profile
of an i.i.d. sample
is a random variable whose distribution depends on only p and n. We therefore write
, and call
the profile entropy with respect to (p, n). Analogously, we call
, the profile dimension associated with (p, n) and write
. Due to the dependence among multiplicities, the distributions of
and
are rather complex in general. To obtain clean expressions, we can adopt the standard Poisson sampling technique and make the sample size a Poisson variable
, independent of the sample. As an example,
where denotes the probability of symbol x assigned by p. Note that sometimes we also write p(x) instead of
for notational convenience. Despite the complex landscape of statistical dependency, in Theorem 1, we show that
and
are of the same order, with high probability and for every
. In Theorem 6 and 7, we show that
highly concentrates around a variant of its expectation. In Section 2.5, we provide a much simpler quantity
that well approximates the expectation variant of
. Combined, these three results provide a precise characterization of both profile entropy and dimension. Leveraging this in Section B.2, we derive nearly-tight bounds on the magnitude of profile entropy for several important structural distribution families, including log-concave and power-law.
1.2 Applications and Prior Works
In this section, we present several learning and compression applications in which the profile entropy would play an important role. We also review related prior works with an emphasize on adaptive algorithms, and suppress the discussions on the worst-case analysis for brevity.
Basics and Significance
The profile of a sample corresponds to the empirical distribution of symbols, and reflects the magnitudes of the actual symbol probabilities. Hence, the profile dimension, the number of distinct symbol frequencies, characterizes the variability of ranges the probabilities spread over. The sample profile’s entropy, which by Theorem 1 is of the same order as its dimension, admits the same interpretation.
Intuitively, samples from simple distributions tend to have low profile entropy, such as those from a m-piecewise distribution with small m, or one whose probability masses concentrate over some sparse set. The profile entropy is also likely to decrease as one reduces the sample size, since the sample contains less information regarding the variability of distribution probabilities. See Theorem 9, 14, and 19 for a formal justification of these arguments.
From a statistical perspective, the profile of a sample is a sufficient statistic for estimating the probability multiset and any symmetric functional of the underlying distribution, such as entropy and support size. When we express a profile as a collection of multiplicity-prevalence pairs, the profile dimension is the size of this collection. Being of the same magnitude, the profile entropy is thus the effective size of a natural sufficient statistic for label-invariant inference.
Profile entropy also directly connects to adaptive testing and classification. Such connection arises from computing the profile probability Acharya et al. [2011, 2012], the probability of observing a sample with the given profile. If the profile has entropy of H, we can show that this computation problem has a time complexity of . The result follows by the equivalence of the problem and computing the permanent of a rank-
matrix Barvinok [1996], Vontobel [2012, 2014], Barvinok [2016].
Below, we introduce two important applications that are more involved, in which the profile entropy has essential connection to the statistical efficacy of adaptive learning.
Distribution Estimation
Estimating unknown distributions from their samples is a statistical-inference cornerstone, and has numerous applications, ranging from biological studies Armañanzas et al. [2008] to language modeling Chen and Goodman [1999].
A learning algorithm in this setting is often referred to as a distribution estimator, which is a functional associating with every sequence
over X a distribution
. Given a sample
, we measure the performance of
in estimating the (unknown) distribution p with a loss function
, e.g., the
distance and KL divergence.
A classical worst-case type result shows that any estimator that achieves a small loss of
over
in expectation requires a sample size of
. Recent research further shows that the naive empirical-distribution estimator attains the optimal sample efficiency, to the right constants.
The desire to design more efficient estimators for practical distributions such as Poisson mixtures leads to two adaptive estimation frameworks: structural and competitive.
Structural estimation focuses on distributions possessing a natural structure, such as monotonicity, m-modality, and log-concavity. In many cases including the mentioned, structural assumptions lead to effective estimators that provably perform better on the corresponding distribution classes. See Bühlmann et al. [2016] for a review of recent literature.
Competitive estimation aims to design estimators that are universally near-optimal. Without strong structural knowledge, a reasonable estimator should naturally assign the same probability to symbols appearing equal number of times. The objective here is to find an estimator that learns every distribution as well as the best natural estimator designed with knowledge of the true distribution. Discussion continues in Section 2.3 with a review of relevant works.
Property (Functional) Inference
Instead of recovering the underlying distribution, numerous practical applications require only inferring a particular property value, such as entropy for graphical modeling Koller and Friedman [2009], and support size for species richness estimation Magurran [2013].
Formally, a distribution property over a distribution collection is a functional
that associates with each distribution in P a real value. Given a sample
from an unknown distribution
, the problem of interest is to infer the value of f(p). To do this, we employee another functional
, a property estimator that maps every sample to a real value.
The statistical efficiency of in estimating f with respect to the distribution collection P is measured by its sample complexity. Specifically, for an accuracy
and error tolerance
, the
-sample complexity of
with respect to (f, P) is the minimal sample size n for which
for all
. Note that for the special case of P = {p}, the sample complexity directly characterizes the ability of
in estimating f(p).
Recent years have shown interests in determining the sample complexities of inferring distribution properties. Built upon worst-case analysis, the major contribution of these works is establishing the sufficiency of sample sizes sub-linear in |X|. As an example, in the vital sample-sparse regime and over , the
-sample complexity of learning entropy is
. We refer the readers to Verdú [2019] for a thorough survey of related works.
As the problem involves two components, the property and distribution, adaptive analysis also advances in two veins.
The first vein concerns constructing a universal plug-in estimator for all symmetric properties. A symmetric property is invariant under symbol permutations, hence it suffices to obtain an accurate estimate of the probability multiset. Recently, following the works of Das [2012], Acharya et al. [2017], Hao and Orlitsky [2019a] show that for any symmetric property that is additively separable and appropriately Lipschitz, the profile maximum likelihood (Section 2.2) achieves the optimal sample complexity up to small constant factors. Other major works include Valiant and Valiant [2011, 2013, 2016], Han et al. [2018], Charikar et al. [2019b]
The second vein is an analogy to the competitive distribution estimation framework, and aims to compete with the instance-by-instance performance of a genie having access to more information, but reasonably restricted. A natural choice for the genie is the best-known and most-used – the empirical estimator that evaluates the property at the sample empirical distribution. To empower the genie, we grant it access to a sample whose size is logarithmically larger than that available to the learner. One can show that this enables the genie to universally achieve the optimal sample complexities for numerous properties and hypothesis classes P. Under this formulation, Hao et al. [2018], Hao and Orlitsky [2019b] provide a unified learning algorithm that achieves the optimal competitiveness guarantees in near-linear time.
In this work, we further both veins of works and show that: 1) the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes; 2) when plugged into entropy, the estimator in Hao and Orlitsky [2019c] approximates the property as well as the plug-in estimator whose distribution component is the best natural, for every distribution. See Theorem 3 and 4 for the formal statements.
We establish essential connections between profile entropy and the estimation of distributions, inference of their properties, and compression of profiles. To further our understanding of profile entropy, we then investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.
For space considerations, we relegate most technical proofs to the appendices.
Permutation invariance By definition, both the profile of a sequence and its dimension are invariant to domain-symbol permutations. Since entropy is a symmetric property, the profile entropy of an i.i.d. sample is also permutation invariant. Consequently, a result in this section that holds for a distribution will also hold for any distributions sharing the same probability multiset.
This is desirable for practical applications, since samples often come as categorical data, while the symbol ordering under which the underlying distribution would exhibit certain structure is unknown to the learner. For example, in natural language processing, we observe words and punctuation marks. Given that the data comes from a power-law distribution Mitzenmacher [2004], we often don’t know how to order the alphabet to realize such a condition.
Surprisingly, with a few exceptions such as Hao and Orlitsky [2019c], most previous works on learning structured discrete distributions do not address this crucial matter in their learning algorithms. The existing results are rather artificial and more like learning discretized continuous distributions. See Section B.2 for our discussion on distribution discretization.
2.1 Profile Dimension and Entropy
where the notation hides logarithmic factors of n.
The theorem shows that for every distribution and sampling parameter n, the induced profile entropy and profile dimension are of the same order, with high probability.
Taking expectation and noting that yield
2.2 Adaptive Property Estimation
Definitions A profile is said to have length n if there exists
satisfying
. For every profile
of length n and distribution collection
, the profile maximum likelihood (PML) estimator Orlitsky et al. [2004] over P maps
to a distribution
that maximizes the probability of observing the profile . For any property f, let
denote the smallest error that can be achieved by any estimator with a sample size n and tolerance
on the error probability. This definition is equivalent to that of the sample complexity. Below, we assume that P is label invariant, i.e., for any
, collection P contains all its symbol-permuted versions.
We first show that profile-based estimators are sufficient for estimating symmetric properties. Theorem 2 (Sufficiency of profiles). Let f be a symmetric property over P. For any accuracy and tolerance
, if there exists an estimator
such that
there is an estimator over length-n profiles satisfying
Note that both estimators can have independent randomness.
The second result shows that the PML estimator is adaptive to the simplicity of underlying distributions in inferring all symmetric properties, over any label-invariant P. For clarity, we set and suppress both
and P in
.
Theorem 3 (Adaptiveness of PML). Let f be a symmetric property. For any and
, with probability at least
,
Some comments: 1) The theorem holds for any symmetric properties, while nearly all previous works require the property to possess certain forms and be smooth; 2) The theorem trivially implies a weaker result in Acharya et al. [2017] where is replaced by
; 3) There is a polynomial-time approximation Charikar et al. [2019a] achieving the same guarantee; 4) We provide a stronger result in Section A.5 of the appendices for general
.
Besides this theorem, we establish in Section A.6 and A.7 two additional results on PML. The first result addresses sorted distribution estimation, and improves over that established in Hao and Orlitsky [2019a] (Theorem 5) in terms of the lower bound on the accuracy parameter . Let
denote the collection of symbol permutations over X. For any
, denote by
the permuted distribution satisfying
for all symbols
. Let
be a positive absolute constant that can be made arbitrarily small, e.g.,
.
Lemma 1. For any , and
, if we have
and
, with probability at least
,
A few comments in order: 1) The polynomial-time computable variant of PML in Charikar et al. [2019a] satisfies the same guarantee, and the proof for this is also similar to that in Section A.6; 2) Using the existing efficiently computable PML-type methods Charikar et al. [2019a,b], the best possible lower bound on is
; 3) Below the
threshold, the empirical distribution estimator is sample optimal up to constant factors Han et al. [2018].
The second result shows an intriguing connection between the PML method and the task of uniformity testing Goldreich and Ron [2011]. See Section A.7 for details.
Our last result in this section addresses entropy estimation. We show that when plugged into entropy, the estimator in Hao and Orlitsky [2019c] approximates the property as well as the plug-in estimator whose distribution component is the best natural, for every distribution.
Recall that a distribution estimator is natural if it assigns the same probability to symbols of equal multiplicity, and a property estimator is plug-in if it first finds an estimate of the distribution and then evaluates the property at this estimate. As an off-the-shelf method, the plug-in approach is widely used in estimating distribution properties.
If further the property is symmetric, then it suffices to obtain an accurate estimate of the probability multiset, which is intuitively more statistically efficient than recovering the actual distribution. For example, Hao and Orlitsky [2019a] recently show that for any symmetric property that is additively separable and appropriately Lipschitz, the PML multiset estimator Orlitsky et al. [2004] achieves the optimal sample complexity up to small constant factors.
However, the analysis and computation (though efficient) of such multiset-based estimation methods are often involved Valiant and Valiant [2011, 2013, 2016], Han et al. [2018], Charikar et al. [2019b], Hao and Orlitsky [2019a]. For this reason, distribution-based plug-in estimators are still popular in practice, and often, the distribution components are natural.
As an example, for entropy estimation, several widely used distribution-based estimators are natural plug-in, such as the empirical estimator plugging in the empirical distribution, James-Stein shrinkage Hausser and Strimmer [2009] that shrinks the distribution estimate towards uniform, and Dirichlet-smoothed Schürmann and Grassberger [1996] that imposes a Dirichlet prior over .
The logic behind these estimators is simple: if two distributions are close, then the same is expected to hold for their entropy values. The next theorm shows that for every distribution and among all plug-in entropy estimators, the distribution estimator in Hao and Orlitsky [2019c] is as good as the one that performs best in estimating the actual distribution.
Denote by N the collection of all natural estimators. Write as
for compactness and the KL-divergence between
as
.
Theorem 4 (Competitive entropy estimation). For any distribution p, sample with profile
, and
, we have
with probability at least .
2.3 Competitive Distribution Estimation
Prior works Competitive estimation calls for an estimator that competes with the instance-by-instance performance of a genie knowing more information, but reasonably restricted. Denote by the KL divergence. Introduced in Orlitsky and Suresh [2015], the formulation considers the collection N of all natural estimators, and shows that a simple variant
of the Good-Turing estimator achieves
for every distribution p and with high probability. We refer to the left-hand side as the excess loss of estimator with respect to the best natural estimator, and note that it vanishes at a rate independent of p. For a more involved estimator in Acharya et al. [2013], the excess loss vanishes at a faster rate of
, optimal up to logarithmic factors for every estimator and the respective worst-case distribution. For the
distance, Valiant and Valiant [2016] derive a similar result.
These estimators track the loss of the best natural estimator for each distribution. Yet an equally important component, the excess loss bound, is still of the worst-case nature. For a fully adaptive guarantee, Hao and Orlitsky [2019c] design an estimator that achieves a
excess loss, i.e.,
for every p and , with high probability. Utilizing the adaptiveness of
to the simplicity of distributions, the paper derives excess-loss bounds for several important distribution families, and proves the estimator’s optimality under various of classical and modern learning frameworks.
New results While the work of Hao and Orlitsky [2019c] provides an appealing upper bound on the excess loss, it is not exactly clear how good this bound is as a matching lower bound is missing. In this work, we complete the picture by showing that the bound is essential for competitive estimation and optimal up to logarithmic factors of n.
Theorem 5 (Minimal excess loss). For any and distribution estimator
, there is a distribution p such that with probability at least 9/10, we have both
By Theorem 1, we can replace by
in both the upper and lower bounds.
2.4 Optimal Profile Compression
While a labeled sample contains all information, for many modern applications, such as property estimation and differential privacy, it is sufficient Orlitsky et al. [2004] or even necessary to provide only the profile Suresh [2019]. Hence, this section focuses on the lossless compression of profiles.
For any distribution p, it is well-known that the minimal expected codeword length (MECL) for losslessly compressing a sample is approximately nH(p), which increases linearly in n as long as H(p) is bounded away from zero.
On the other hand, by the Hardy-Ramanujan formula Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies
Consequently, the MECL for losslessly compressing the sample profile is at most
, a number potentially much smaller than nH(p).
By Shannon’s source coding theorem, the profile entropy is the information-theoretic limit of MECL for the lossless compression of profile
. Below, we present explicit block and sequential profile compression schemes achieving this entropy limit, up to logarithmic factors of n.
Block compression The block compression algorithm is intuitive and easy to implement.
Recall that the profile of a sequence is the multiset
of multiplicities associated with symbols in
. The ordering of elements in a multiset is not informative. Hence equivalently, we can compress
into the set
of corresponding multiplicity-prevalence pairs, i.e.,
The number of pairs in is equal to the profile dimension
. In addition, both a prevalence and its multiplicity are integers in [0, n], and storing the pair takes 2 log n nats. Hence, it takes at most
nats to store the compressed profile. By Theorem 1, for any distribution
and
,
Sequential compression For any sequence , the setting for sequential profile compression is that at time step
, the compression algorithm knows only
and sequentially encodes the new information. This is equivalent to providing the algorithm
at time step t.
Suppress in the expressions for the ease of illustration. For efficient compression, we sequentially encode the profile
into a self-balancing binary search tree T , with each node storing a multiplicity-prevalence pair
and
being the search key. We present the algorithm details as follows.
The algorithm runs for exactly n iterations, with a O(log n) per-iteration time complexity. For an i.i.d. sample , the expected space complexity is again
.
2.5 Attributes of Profile Entropy and Dimension
To further our understanding of profile entropy and dimension, we investigate the analytical and statistical attributes of these characteristics concerning their concentration, computation and approximation, monotonicity, and Lipschitzness.
Concentration
Recall that the multiplicity denotes the number of times symbol y appearing in
. Denote by
the logical OR operator. For any distribution p and
, we have
The statistical dependency landscape of terms in the summation is rather complex, since and
are dependent for every (x, y) pair due to the fixed sample size; and so are
and
for every pair of distinct
and
. To simplify the derivations, we relate this quantity to its variant under the aforementioned Poisson sampling scheme, i.e., making the sample size an independent
. Specifically, define
Note that this is not the same as since the summation index goes up only to n. Denote the expected value of
by
. Our result shows that the original
satisfies a Chernoff-Hoeffding type bound centered at
.
Theorem 6. Under the above conditions and for any , and
,
and for any ,
As a corollary, the value of is often close to
. Corollary 1. Under the same conditions as above and for any
and distribution
, with probability at least
,
In addition, we establish an Efron-Stein type inequality.
Theorem 7. For any distribution p and ,
Computation and Approximation
The above results show that is often close to
, with an exponentially small deviation probability. Hence, to approximate
, it suffices to accurately compute
, the expectation of its Poissonized version
. By independence and the linearity of expectations,
The expression is exact but does not relate to p in a simple manner. For an intuitive approximation, we partition the unit interval into a sequence of ranges,
denote by the number of probabilities in
, and relate
to a shape-reflecting quantity
the sum of the effective number of probabilities lying within each range Hao and Orlitsky [2019c]. To compute , we simply count the number of probabilities in each
. Our main result shows that
well approximates
over the entire
, up to logarithmic factors of n.
Theorem 8. For any and
,
Summary The simple expression shows that characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,
closely approximates
, the value around which
concentrates (Theorem 6). Henceforth, we use
as a proxy for both
and
, and study its attributes and values.
Monotonicity
Among the many attributes that possesses, monotonicity is perhaps most intuitive. One may expect a larger value of
as the sample size n increases, since additional observations reveal more information on the variability of probabilities. Below we confirm this intuition.
Theorem 9. For any and
,
Besides the above result that lowerly bounds with
for
, a more desirable result is to upperly bound
with a function of
.
Such a result will enable us to draw a sample of size , obtain an estimate of
from
, and use it to bound the value of
and thus of
for a much larger sample size n. With such an estimate, we can perform numerous tasks such as predicting the performance of algorithms in Section 2.2 and 2.3 when more observations are available, and the space needed for storing a longer sample profile. The next theorem provides a simple and tight bound on
in terms of
.
Theorem 10. For any and
,
The aforementioned application of this result is closely related to the recent works on learnability estimation Kong and Valiant [2018], Kong et al. [2019].
Lipschitzness
Viewing as a distribution property, we establish its Lipschitzness with respect to a weighted Hamming distance and the
distance. Given two distributions
, the vanilla Hamming distance is denoted by
The distance is suitable for being a statistical distance since there may be many symbols at which the two distributions differ, yet those symbols account for only a negligible total probability and has little effects on many induced statistics. To address this, we propose a weighted Hamming distance
The next result measures the Lipschitzness of under
.
Theorem 11. For any integer n, and distributions p and q, if for some
,
Replacing with
results in a common similarity measure – the
distance. The next theorem is an analog to the above under this classical distance.
Theorem 12. For any integer n, and distributions p and q, if for some
,
where c is a constant in [1/3, 3]. Note that the inequality is significant iff , since the value of
is at most
for all p.
2.6 Profile Entropy for Structured Families
Following the study of attributes of profile entropy, we derive nearly tight bounds for the values of three important structured families, log-concave, power-law, and histogram. These bounds tighten up and significantly improve those in Hao and Orlitsky [2019c], and show the ability of profile entropy in charactering natural shape constraints.
Below, we follow the convention of specifying the structured distributions over X = Z.
Log-Concave Distributions
We say that is log-concave if p has a contiguous support and
for all
.
The log-concave family encompasses a broad range of discrete distributions, such as Poisson, hyperPoisson, Poisson binomial, binomial, negative binomial, geometric, and hyper-geometric, with wide applications to numerous research areas, including statistics Saumard and Wellner [2014], computer science Lovász and Vempala [2007], economics An [1997], algebra, and geometry Stanley [1989].
The next result upperly bounds the profile entropy of log-concave families, and is tight up to logarithmic factors of n.
Theorem 13. For any and distribution
, if p is log-concave and has a variance of
,
This upper bound is uniformly better than the bound in Hao and Orlitsky [2019c].
A similar bound holds for t-mixtures of log-concave distributions. More concretely,
Theorem 14. For any integer n and distribution , if p is a t-mixture of log-concave distributions each has a variance of
, where i = 1, . . . , t,
The introduction about log-concave families covers numerous classical discrete distributions, yet leaves many more continuous ones untouched Bagnoli and Bergstrom [2005]. Below, we present a discretization procedure that preserves distribution shapes such as monotonicity, modality, and log-concavity. Applying this procedure to the Gaussian distribution further shows the optimality of Theorem 13.
Discretization
Let X be a continuous random variable over R with density function f(x). For any , denote by
the closest integer z such that
. The
has a distribution over Z:
We refer to as the discretized version of X.
Shape preservation By definition, one can readily verify that the above transformation preserves several important shape characteristics of distributions, such as monotonicity, modality, and k-modality (possibly yields a smaller k). The following theorem covers log-concavity.
Theorem 15. For any continuous random variable X over R with a log-concave density f, the distribution associated with
is also log-concave.
Moment preservation Denote by p the distribution of for
. Let
and
be the mean and variance of density f, given that they exist. The theorem below shows that the discrete distribution p has, within small additive absolute constants, a mean of
and variance of
.
Theorem 16. Under the aforementioned conditions, the mean of satisfies
and the variance of satisfies
Optimality of Theorem 13
By the above formula, the discretized Gaussian has a distribution in the form of
Consolidating Theorem 15 and 16 shows that is a log-concave distribution with a variance of
. Consequently, Theorem 13 yields the following upper bound:
The optimality of Theorem 13 follows by these inequalities.
Power-Law Distributions
We say that a discrete distribution is a power-law with power
if p has a support of [k] := {1, . . ., k} for some
and
for all
.
Power-law is a ubiquitous structure appearing in many situations of scientific interest, ranging from natural phenomena such as the initial mass function of stars Kroupa [2001], species and genera Humphries et al. [2010], rainfall Machado and Rossow [1993], population dynamics Taylor [1961], and brain surface electric potential Miller et al. [2009], to man-made circumstances such as the word frequencies in a text Baayen [2002], income rankings Dr˘agulescu and Yakovenko [2001], company sizes Axtell [2001], and internet topology Faloutsos et al. [1999].
Unlike log-concave distributions that concentrate around their mean values, power-laws are known to possess “long-tails” and always log-convex. Hence, one may expect the profile entropy of power-law distributions to behave differently from that of log-concave ones. The next theorem justifies this intuition and provides tight upper bounds.
Theorem 18. For a power-law distribution with power
, we have
where
The above upper bound fully characterizes the profile entropy of power-laws and surpasses the basic bound for both
and
. In comparison, Hao and Orlitsky [2019c] yields a
upper bound, which improves over
for only
and is worse than that above for all
.
Histogram Distributions
A distribution is a t-histogram distribution if there is a partition of X into t parts such that p has the same probability value over all symbols in each part.
Besides the long line of research on histograms reviewed in Ioannidis [2003], the importance of histogram distributions rises with the rapid growth of data sizes in numerous engineering and science applications in the modern era.
For example, in scenarios where processing the complete data set is inefficient or even impossible, a standard solution is to partition/cluster the data into groups according to the task specifications and element similarities, and randomly sample from each group to obtain a subset of the data to use. This naturally induces a histogram distribution, with each data point being a symbol in the support.
The work of Hao and Orlitsky [2019c] studies the class of t-histogram distributions and obtains the following upper bound
Our contribution is establishing its optimality. Theorem 19. For any , there exists a t-histogram distribution p such that
Note that uniform distributions correspond to 1-histograms, for which the bounds reduce to .
3.1 Multi-Dimensional Profiles
The notion of profile generalizes to the multi-sequence setting.
Let X be a finite or countably infinite alphabet. For every and tuple
of sequences in
, the multiplicity
of a symbol
is the vector of its frequencies in the tuple of sequences. The profile of
is the multiset
of multiplicities of the observed symbols Acharya et al. [2010], Das [2012], Charikar et al. [2019b], and its dimension is the number
of distinct elements in the multiset. Drawing independent samples from
, the profile entropy is simply the entropy of the joint-sample profile.
Many of the previous results potentially generalize to this multi-dimensional setting. For example, thebound on
in the 1-dimensional case becomes
and
This essentially recovers thebound for d = 1.
3.2 Concluding Remarks
The classical view on the entropy of an i.i.d. sample corresponds to the equation
which provides little insight for statistical applications.
This paper presents a different view by decomposing the nH(p) information into three pieces: the labeling of the profile elements, ordering of them, and profile entropy. With no bias towards any symbols and under the i.i.d. assumption, the profile entropy rises as a fundamental measure unifying the concepts of estimation, inference, and compression.
Appendices orgnization In the appendices, we order the results and proofs according to their logical priority. In other words, the proof of a theorem or lemma mainly relies on preceding results. For the ease of reference, the numbering of the theorems is consistent with that in the main paper.
Consider an arbitrary sequence over a finite or countably infinite alphabet X. The multiplicity
of a symbol
is the frequency of y in
. The prevalence of an integer
is the number
of symbols in
with multiplicity
. The profile of
is the multiset
of multiplicities of the symbols in
, which we describe as a profile of length n.
Let be a collection of distributions over X. We say that a distribution collection
is label-invariant if for any
, the collection P contains all its symbol-permuted versions. A distribution property over a distribution collection
is a functional
that associates with each distribution in P a real value. For a label-invariant
, we say that a property f over P is symmetric if it takes the same value for distributions sharing the same probability multiset.
A.1 Profile Dimension and Its Concentration
A profile is said to have length n if there exists
satisfying
. For any multiset S, we define its dimension as the number D(S) of distinct elements in S. Recall that profiles of sequences are multisets. For notational convenience, we write both
and
for the dimension of profile
.
Viewed as a random variable, the profile of an i.i.d. sample has a distribution depending only on p and n. Hence, we denote by
such a profile and write
. Analogously, we denote by
the profile dimension associated with (p, n), and write
.
Recall that the multiplicity denotes the number of times symbol y appearing in
. Denote by
the logical OR operator. For any distribution p and
, we have
The statistical dependency landscape of terms in the summation is rather complex, since and
are dependent for every (x, y) pair due to the fixed sample size; and so are
and
for every pair of distinct
and
. To simplify the derivations, we relate this quantity to its variant under the aforementioned Poisson sampling scheme, i.e., making the sample size an independent
. Specifically, define
Note that this is not the same as since the summation index goes up only to n. Denote the expected value of
by
. Our result shows that the original
satisfies a Chernoff-Hoeffding type bound centered at
.
Theorem 6. Under the above conditions and for any , and
,
and for any ,
Proof. To simplify our analysis, we first consider an alternative model where the sample size is an independent Poisson random variable N with mean n. A nice attribute of Poisson sampling is that all the multiplicities are independent of each other. Later, we will relate this model to the fixed-sample-size model and establish our claim rigorously.
For simplicity and clarity, we suppress in
and write
instead of
when the multiplicity is obtained through Poisson sampling. For any
, denote
. Instead of analyzing
, we consider
Note that for any disjoint , the functions
and
are discordant monotone by each argument, namely, when we increase the value of each
, the increase in the value of one function implies the non-increase of the other. Then, by the results in Lehmann [1966], the values of the two functions, when viewed as random variables, are negatively associated.
Next we show that quantity satisfies a Chernoff-type bound.
Let be an arbitrary positive number. Note that
is a Bernoulli random variable with parameter
Then for the expected value of , we have
For simplicity, write and
. By Markov’s inequality and the monotonicity of function
over t > 0,
It suffices to bound by a function of other parameters.
where (a) follows by the definition of Y ; (b) follows by follows by the fact that
is negatively associated with
follows by an induction argument via negative association; (e) follows by the fact that
is a Bernoulli random variable with mean
follows by the inequality
follows by
; and (h) follows by
. Applying standard simplifications, we obtain
The proof will be complete upon noting that: 1) the probability that N = n is at least ;
2) conditioning on N = n transforms the sampling model to that with a fixed sample size n.
As a corollary, the value of is often close to
.
Corollary 2. Under the same conditions as above and for any , with probability
Proof. To establish the lower bound, note that if , setting
in Theorem 6 yields
else if , setting
yields
As for the upper bound, if ,
and for any ,
Combining these tail bounds through the union bound completes the proof.
In addition to the above, we establish an Efron-Stein type inequality.
Theorem 7. For any distribution p and ,
Proof. First, note that for any and
,
Therefore, the variance of the profile dimension satisfies
A.2 Profile Entropy and Its Connection to Dimension
For a distribution and sampling parameter n, the profile entropy with respect to (p, n) is the entropy
of the sample profile
. By Shannon’s source coding theorem, profile entropy
is the information-theoretic limit of the minimal expected codeword length (MECL) for the lossless compression of the sample profile. Hence, characterizing its value is of fundamental importance. But as one may expect, the distribution of
is sophisticated and over a large alphabet.
More concretely, by the formula of Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies the equation
Despite the complex statistical dependency landscape and the exponentially large alphabet size, below we establish that for any distribution and sample size, the profile entropy is often of the same order as the profile size, with high probability. Specifically,
Theorem 1. For any distribution and
, with probability at least
,
We decompose the proof of the theorem into three steps. First, we show that with high probability, which is a simple consequence of Shannon’s source coding theorem and Theorem 6 (which shows that
highly concentrates around its expectation). Then, we introduce a simple quantity
that approximates the expectation of
to within logarithmic factors of n. Finally, leveraging this approximation guarantee, we establish the other direction of the theorem. This step is more involved due to the aforementioned complications.
A. Bounding Profile Entropy by Its Dimension
By the tail bounds (Theorem 6) and trivial lower bound of 1 on the profile dimension, with probability at least , the expectation of
satisfies
By the block profile compression algorithm presented in Section 2.4 of the main paper, storing profile losslessly takes
nats space in expectation. By Shannon’s source coding theorem, the expected space to losslessly storing a random variable is at least its entropy. Hence, with probability at least ,
Again, noting that completes the proof.
B. Simple Approximation Formula for Profile Dimension
It remains to show that , with high probability. To proceed further, we note that
is often close to
, the expectation of its Poissonized version
, with an exponentially small deviation probability. Hence, to approximate
, it suffices to accurately compute
. By independence and the linearity of expectations,
The expression is exact but does not relate to p in a simple manner. For an intuitive approximation, we partition the unit interval into a sequence of ranges,
denote by the number of probabilities
belonging to
, and relate
to an induced shape-reflecting quantity,
the sum of the effective number of probabilities lying within each range Hao and Orlitsky [2019c]. To compute , we simply count the number of probabilities in each
. Our main result shows that
well approximates
over the entire
, up to logarithmic factors of n.
Theorem 8. For any and
,
Proof. The fact that upperly bounds
simply follows by the concentration of Pois- son variables, and is established in Hao and Orlitsky [2019c]. Below we show that the quantity also serves as a lower bound. By construction, for any given sampling parameter n, index j, and symbol x with probability
, the corresponding symbol multiplicity
. Hence, we can express the expectation of
as
This proves the aforementioned formula. Then, for every sufficiently large index j and , define a sequence of intervals,
Now we analyze the contribution of indices to the expected value of
. For clarity, we divide our analysis into two cases:
and
.
Consider the collection of probabilities
, and the collection
of intervals
. By construction, each probability in
is contained in at least
many intervals in
. Hence the total number of probabilities (repeatedly counted) included in
is at least
. Note that the number of intervals in
is less than 2j log n. We claim that there exists one (or more) interval
containing at least
probabilities. By construction, there are at least
neighboring intervals of
that contain at least
probabilities. The contribution of these these intervals to the expected value of
is at least
times
where the last step holds if . This yields a lower bound of
.
It remains to consider the case. Again, the total number of probabilities included in
is at least
. Furthermore, each interval
contains at most
probabilities and there are less than 2j log n intervals. Therefore, the number of intervals that contain at least
probabilities is at least
. Otherwise, the number of probabilities included in
is less than
which leads to a contradiction. Analogously, the contribution of these these intervals to the expected value of is at least
times
which yields a lower bound of on the expected value of
.
Consolidating the previous results shows that
C. Bounding Profile Dimension by Its Entropy
Next, we establish that for any distribution , with probability at least
,
Let p be an arbitrary distribution in . Recall that we partition the interval (0, 1] into a sequence of sub-intervals,
and denote by the number of probabilities
in
.
Our current objective is to bound from below by a nontrivial multiple of
. For simplicity of derivations, we will adopt the standard Poisson sampling scheme and make the sample size an independent Poisson variable
. For notational simplicity, we will suppress
in all the expressions and write the profile as
by slightly abusing the notation.
Note that the profile can be equivalently expressed as a length-n vector
where denotes the number of symbols appearing exactly i times.
For a sufficiently large absolute constant c, decompose into c parts according to
such that the t-th part (t = 1, . . . , c) consists of
’s satisfying
with
. Since by definition,
one of the c parts corresponds to a partial sum of at least . Without loss of generality, we assume that it is the second part, i.e.,
Apply standard Poisson tail probability bounds, e.g., Lemma 2. Let Y be a Poisson or binomial random variable with mean value . Then,
and
For any and with probability at least
, one can express the truncated profile
over
as a function of
for x satisfying
.
Basically, this says that for every x, the number of its appearance is not too far away from the expected value. By the union bound, this is true for all with probability at least
, as j can take only n possible values. Denote the last event by A.
To proceed, we recall the formula of Hardy and Ramanujan [1918] on the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles:
Below, we will use a weaker version that works for any n:
Then, conditioning on A, the truncated profiles for
are independent. Since conditioning reduces entropy,
where the third last step follows by
the second last follows by for any X with a support size of k, and the fact that there are at most
many profiles of length m, as we explained above; and the last step follows by the elementary inequality
Our new objective is to bound from below. We will find a sub-interval
of
and bound
in the rest of the section, since
For all , our lower bound is simply
Henceforth, we assume that j is sufficiently large and denote .
For any j and every integer , define a sequence of intervals,
On the other hand, the following upper bound holds.
In other words, for any that differ by at most
,
Partition into sub-intervals of equal length
. The partition has a size of at most
. Assign each probability
a length-
interval
centered at
. Then, each interval
covers at least one of the sub-intervals in the partition. Since there are exactly
intervals
, one can find a partition sub-interval
contained in at least
of them. Denote by
the collection of symbols corresponding to these intervals.
Next, we bound from below the entropy of the truncated profile over
. Denote by
the left end point of
. By the chain rule of entropy for multiple random variables,
Consider a particular term on the right-hand side with . By the conditional independence and fact that conditioning reduces entropy,
To characterize the condition, we define a random variable
Note that , which is at most 1/10 for
. The following lemma transforms this into a high-probability statement.
Lemma 3. Let be independent indicator random variables. Let
denote their sum and
denote the expected sum. Then for c > 0, we have
Below we consider only . Note that c/(2 + 2c/3) is increasing for c > 0.
where we set c = 4 in the above lemma and assume that (assuming only
, the upper bound becomes 3/4). Recall that
Denote by the collection of
satisfying
. The above derivation shows that
By independence, for any , we have
For any with
, the corresponding indicator variable satisfies
On the other hand, for any ,
Therefore, the corresponding indicator variable satisfies
To summarize, we have shown that is the sum of |X| independent Bernoulli random variables. Among these Bernoulli variables, at least
have a bias of
, while others have a bias of at most
.
The following lemma, recently established by Hillion et al. [2019], shows the relation among the entropy values of sums of independent Bernoulli random variables with different bias parameters.
Lemma 4. Let be independent indicator random variables. Denote by X and Y the sums of
’s and
’s, respectively. If
,
This lemma, together with the previous results, shows that
Proof. By definition, the left-hand side satisfies
By Stirling’s formula, for any ,
Substituting the right-hand side into the above equation yields
Let g(x) := 0 for and g(x) := (x + 1/2) log x for
. Simple calculus shows that the function is concave. Applying the concavity of g to the last sum yields
where the last step follows by the fact that . A similar inequality holds for the weighted sum of
. Consolidating these inequalities, we obtain
On the other hand, for the log m! term,
Substituting the previous term bounds into the H(bin(m, q)) expression yields
Before continuing, we remark that the bound in the above lemma has the right dependence on in the sense that if we fix q and increase m, the lower bound converges to
q))). Another point to mention is that the above bound covers
, while Lemma 6 appearing later in this section covers
. Note that the dependence on
changes from logarithmic to linear, showing an “elbow effect” around 1/m.
Assume that , then for any
,
Consolidating this with the previous results yields that
where we utilize and
for
. We can then bound the quantity of interest as follows.
On the other hand, if , we can further “compress” the truncated profile
over
to reduce the effective value of
. Specifically, for any integer
, we define the t-compressed version of
as
Note that for each t, the length of is
. For each entry in the compressed version, we can again express the entry as the sum of independent indicator random variables. Specifically,
Furthermore, for any , the expectation of each indicator variable satisfies
Similarly, for any , we have
.
Now, choose t large enough so that . Following the reasoning in the previous case shows that
It remains to consider the case of , for which we adopt our previous analysis.
Again, partition into sub-intervals of equal length
. Then, assign each probability
a length-
interval
centered at
. By construction, each interval
covers at least one of the sub-intervals in the partition. Redefine any of these covered sub-intervals as
. Denote by
the collection of symbols corresponding to the covering intervals.
Note that . For any
, the previous analysis shows that
Proof. By symmetry, we need to consider only the case of .
Consolidating the lemma and the chain rule of entropy yields,
Alternatively, we can use the fact that adding independent random variables does not decrease entropy, i.e., for any independent variables Y and Z. Note that
Let y be an arbitrary symbol that belongs to . Then,
By the previous derivations, both and
belong to
. Hence,
Note that this argument does not apply to other cases, since
while can be as large as
in general.
The proof is complete upon noting that indices with j = O(1) corresponds to a total contribution of at most O(1) to and
, with probability at least
.
Summary The simple expression shows that characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,
closely approximates
, the value around which
concentrates (Theorem 6) and
lies (Thoerem 1). Henceforth, we use
as a proxy for both
and
, and study its attributes and values.
A.3 Symmetric Property Estimation and Sufficiency of Profiles
The rest of Section A shows that the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes. For clarity, we divide the full proof into three parts: a) the sufficiency of profiles for estimating symmetric properties; b) the standard “median trick” often used to boost the confidence of learning algorithms; c) the PML method and its competitiveness to the min-max estimators. The proof utilizes several previously established results.
Sufficiency of profiles We first show that profile-based estimators are sufficient for estimating symmetric properties. Recall that a distribution collection is label-invariant if for any
, the collection P contains all its symbol-permuted versions. Then,
Theorem 2. Let f be a symmetric functional over a label-invariant distribution collection . For any accuracy
and tolerance
, if there exists an estimator
such that
there is an estimator over
satisfying
Note that both estimators can have independent randomness.
Proof. First we show that given estimator , there is an estimator
which is symmetric, i.e., invariant with respect to domain-symbol permutations, and achieves the same guarantee. To see this, consider a random permutation
chosen uniformly randomly from the collection of permutations over the underlying alphabet. Let
. Then for any
,
where (a) follows by the definition of follows by the law of total probability; (c) follows by the independence between
and
follows by the symmetry of f and the equivalence of applying
to
and to p; (e) follows by the fact that
and the guarantee satisfied by the estimator
; and (f) follows by the law of total probability.
Before we proceed further, we introduce the following definitions. For any sequence , the sketch of a symbol x in
is the set of indices
for which
. The type of a sequence
is the set
of sketches of symbols appearing in
.
Since is symmetric, there exists a mapping
over types satisfying
. Due to the i.i.d. assumption on the sample generation process, given the profile of a sample sequence, all the different types corresponding to this profile are equally likely. Let
be a mapping that recovers this relation, i.e.,
maps each profile uniformly randomly to a type having this profile.
Then, for any and
,
Consequently, the mapping is a profile-based estimator that satisfies
A.4 Median Trick
The following argument is standard and often used to boost the confidence of learning algorithms.
Lemma 7 (Median trick). Let be real parameters satisfying
. For an accuracy
and a distribution set
, if there exists an estimator
such that
we can construct another estimator that takes a sample of size
and achieves
Proof. Given i.i.d. copies of
, the probability that less than half of them satisfy the inequality in the parentheses is at least
By the law of total probability, the right-hand side equals to
where the first step follows by the Chernoff bound of binomial random variables, and the second step follows by and the inequality
.
Set , the right-hand side is at least
.
Therefore, given a sample of size , we can partition it into t sub-samples of equal size, apply the estimator
to each subsample, and define the median of the corresponding estimates as
.
By the previous reasoning, this estimator satisfies
A.5 Profile Maximum Likelihood and Its Adaptiveness
For every profile of length n and distribution collection
, the profile maximum likelihood (PML) estimator Orlitsky et al. [2004] over P maps
to a distribution
that maximizes the probability of observing the profile .
For any property f, let denote the smallest error that can be achieved by any estimator with a sample size n and tolerance
on the error probability. This definition is equivalent to that of the sample complexity. In the following, we show that the PML estimator is adaptive to the simplicity of underlying distributions in inferring all symmetric properties, over any label-invariant P.
For brevity, set and suppress both
and P in
.
Theorem 3 (Adaptiveness of PML). Let f be a symmetric property and be a label-invariant distribution collection. For any
and
, with probability at least
,
Proof. For any tolerance and distribution
, define the
-typical cardinality of profiles with respect to p as the smallest cardinality
of a set of length-n profiles such that the probability of observing a sample from p with a profile in this set is at least
. The following lemma provides a tight characterization of
in terms of the dimension of
.
Lemma 8. For any and
, with probability at least
,
The proof of the lemma follows by recursively applying Theorem 6. Specifically, let 3 log n, which is at least
, with probability at least
. Then,
where the last inequality holds with with probability at least .
Now let f be a symmetric functional over P. According to Theorem 2, for any parameters and
, if there exists an estimator
such that
there is an estimator over
satisfying
For an arbitrary length-n profile that satisfies
, these error bounds yield
and since by the definition of PML,
By the union bound and triangle inequality,
which basically upperly bounds the probability that
Next we will assume that there exists an estimator satisfying
By Lemma 7, if , we can construct another estimator
that takes a sample of size
is assumed to be an integer here) and achieves a higher-confidence guarantee
Then by the above reasoning, with probability at least ,
For the first term on the right hand side to vanish as quickly as , it suffices to have
Simplify the expressions and apply the union bound. It suffices to have both
If n satisfies these conditions, with probability at least ,
Summary We have shown the following result, which is a strengthened version of Theorem 3. The result shows that the PML plug-in estimator possesses the amazing ability of adapting to the simplicity of data distributions in inferring all symmetric properties, over any label-invariant classes.
If there exists an estimator such that
for any and
where the sample size n satisfies both
with probability at least ,
Alternative statement
Fix P and assume that . Recall that
denotes the smallest error that can be achieved by the best estimator using a size-n sample with a
-confidence guarantee. Below we provide an alternative statement of the above result, which is more compact in its form.
Then, draw a profile . With probability at least
,
Consolidating this with Theorem 1 yields that
Setting and suppressing it in expressions, we establish the desired guarantee: For any
and
, with probability at least
,
Below are some comments in order.
1. The theorem holds for any symmetric properties, while nearly all previous works require the property to possess certain forms and be smooth.
2. The theorem trivially implies a weaker result in Acharya et al. [2017] where is replaced by
, an upper bound due to the formula of Hardy and Ramanujan [1918].
3. There is a polynomial-time approximation algorithm Charikar et al. [2019a] achieving the same guarantee as that stated in the theorem.
A.6 PML and Sorted Distribution Estimation
The arguments below basically follow by Hao and Orlitsky [2019a] and Hao and Orlitsky [2019d], in which we assume that |X| = O(n log n).
Let f be a function in , the collection of Lipschitz functions over [0, 1]. Without loss of generality, we also assume that f(0) = 0. Let
be a threshold parameter to be determined later. An
-truncation of f is a function
One can easily verify that . Next, we find a finite subset of
so that the
-truncation of any
is close to at least one of the functions in this subset.
For an integer parameter s > 3 to be chosen later. Partition the interval into s disjoint sub-intervals of equal length, and define the sequence of end points as
where
. Then, for each
, we find the integer
such that
is minimized and denote it by
. Since
is 1-Lipschitz, we must have
. Finally, we connect the points
sequentially. This curve is continuous and corresponds to a particular
-truncation
, which we refer to as the discretized
-truncation of f. Intuitively, we have constructed an
grid and “discretized” function f by finding its closest approximation
Therefore, for any , the corresponding properties of
and
satisfy
where we slightly abuse the notation and write . Note that
for all
, and
for
. While there are infinitely many
-truncations, the cardinality of the discretized
-truncations of functions in
is at most
Now we consider the task of estimating from
. By construction, the real function
is a constant for
. In addition, the function is Lipschitz and hence for an absolute constant C and an arbitrary interval
, one can construct an explicit polynomial g(z) of degree at most
, satisfying
Combining these facts with Theorem 5 in Hao and Orlitsky [2019d] shows that there exists an estimator that for all
and
satisfying
and
,
where is an absolute constant bounded away from 0 whose value can be arbitrarily small.
Then, by setting , where the asymptotic notation hides a sufficiently small absolute constant, the right-hand side is at most
. Then, Theorem 2 in this paper and Theorem 3 in Acharya et al. [2017] imply that the PML distribution
satisfies
Consider any and
with a profile
. Consolidate the previous results, and apply the union bound and triangle inequality. With probability at least
, the PML plug-in estimator will satisfy
for all functions f in .
Next we consider the “second part” of a function , namely,
Again, we can verify that . To establish the corresponding guarantees, we make use of the following result. Since the profile probability is invariant to symbol permutation, for our purpose, we can assume that
iff
, for all
. Under this assumption, the following lemma relates
to p.
Lemma 9. For any , distribution
and sample
with profile
,
The proof of this lemma follows from 1) the fact that empirical distribution satisfies such a guarantee with the probability bound being , where we ignore labelings and sort the empirical probabilities according to p; 2) the Cauchy-Schwarz inequality applied to bound the expected error of the empirical estimator and McDiarmid’s inequality used to bound the error probability; 3) a variant of Theorem 3 in Acharya et al. [2017] that addresses distribution estimation Das [2012].
Hence, with probability at least ,
for all functions f in .
Consolidate the previous results. By the triangle inequality and the union bound, with probability at least ,
for all functions f in . Now we can conclude that
is also at most the error bound on the right-hand side. The reason is straightforward: Since with high probability, the above guarantee holds for all functions in
, it must also hold for the function that achieves the supremum in
It remains to balance the error bounds on the estimation and deviation probability. Recall that we assume since otherwise the theorem is trivial to prove. Set
such that
. Then, the confidence lower bound becomes
, and the deviation bound reduces to
. The previous derivations also require that
and
. Setting
yields the desired result.
A.7 PML and Uniformity Testing
For a finite domain X, denote by the uniform distribution over X. Given an error parameter
and a sample
from an unknown distribution
, uniformity testing Goldreich and Ron [2011] aims to distinguish between the null hypothesis
and the alternative hypothesis
In the work of Hao and Orlitsky [2019a], it is shown that the following simple PML-based algorithm achieves the optimal sample complexity Paninski [2008] for uniformity testing, up to logarithmic factors of the alphabet size |X|. Note that we instantiate the distribution collection P as
, and use 0 and 1 to indicate whether
or
is accepted.
Figure 1: Uniformity tester
In this section, we present another intriguing connection between the PML estimator and the uniformity testing problem. For any profile of length n, denote
Then for any accuracy , the following uniformity tester Diakonikolas et al. [2016a] is sample optimal up to logarithmic factors.
The following lemma connects the above algorithm to the PML. Lemma 10. Chan et al. [2015] For any profile that corresponds to a non-constant sequence
,
Let be an arbitrary discrete distribution. Recall that in Section A, we partition the unit interval into a sequence of ranges,
denote by the number of probabilities
belonging to
, and relate
to an induced shape-reflecting quantity,
the sum of the effective number of probabilities lying within each range.
The simple expression of shows that it characterizes the variability of ranges the actual probabilities spread over. As Theorem 8 shows,
closely approximates
, the value around which
concentrates (Theorem 6) and
lies (Thoerem 1). In this section, we use
as a proxy for both
and
, and study its attributes and values.
To further our understanding of profile entropy and dimension, we investigate the analytical attributes of concerning monotonicity and Lipschitzness. Then, we present tight upper and lower bounds on the value of
for a variety of distribution families.
B.1 Monotonicity
Among the many attributes that possesses, monotonicity is perhaps most intuitive. One may expect a larger value of
as the sample size n increases, since additional observations reveal more information on the variability of probabilities. Below we confirm this intuition.
Theorem 9. For any and
,
Besides the above result that lowerly bounds with
for
, a more desirable result is to upperly bound
with a function of
. Such a result will enable us to draw a sample of size
, obtain an estimate of
from
, and use it to bound the value of
and thus of
for a much larger sample size n.
With such an estimate, we can perform numerous tasks such as predicting the performance of PML in estimating symmetric properties when more observations are available, and the space needed for storing a longer sample profile. These applications are closely related to the recent works on learnability estimation Kong and Valiant [2018], Kong et al. [2019].
The next theorem provides a simple and tight upper bound on in terms of
. Theorem 10. For any
and
,
Implications Before proceeding to the proof, we first present two simple implications.
In other words, the sequence , is monotonically decreasing and converges to
. As we increase the value of
, which can be viewed as our estimate of
, is getting more and more accurate. For the purpose of adaptive estimation, if
, we can choose the sequence
.
Proof. For clarity, we denote by p(m, j) the value of corresponding to
, and p(n, j) the value of
corresponding to
. Furthermore, denote
, which we assume is an integer. Then by the definition of
,
The lower-bound part basically follows by reversing the above inequalities.
This completes the proof of the theorem.
B.2 Lipschitzness
Viewing as a distribution property, we establish its Lipschitzness with respect to a weighted Hamming distance and the
distance. Given two distributions
, the vanilla Hamming distance is denoted by
The distance is suitable for being a statistical distance since there may be many symbols at which the two distributions differ, yet those symbols account for only a negligible total probability and has little effects on many induced statistics. To address this, we propose a weighted Hamming distance
The next result measures the Lipschitzness of under
.
Theorem 11. For any integer n, and distributions p and q, if for some
,
Proof. Recall that the quantity of interest is
Given the bound of , we denote by
the collection of symbols x at which
. By definition, we have both
and
. Below, we show that these symbols modify the value of
by at most
. By symmetry, the same claim also holds for the distribution q. Combined, these two claims yields the desired result.
First, we consider satisfying
or
. Such a symbol either does not contribute the value of
, or affects only the value of the first term
, which is at most log n. Hence the claim holds for this case.
Next, consider symbols satisfying
for some
and denote the collection of them by
. By the above assumption, we have
. To maximize their impact on
under this constraint, we should set their values to be
for some J to be determined, where each repeats exactly j log n times. Then, the symbols in Z contributes at most
to
, and the above constraint on the total probability mass bounds transforms to
Therefore in this case, the contribution is again , which completes the proof.
Replacing with
results in a common similarity measure – the
distance. The next theorem is an analog to the above under this classical distance. Theorem 12. For any integer n, and distributions p and q, if
for some
,
where c is a constant in [1/3, 3]. Note that the inequality is significant iff , since the value of
is at most
for all p.
By symmetry, it suffices to prove the following lemma. Lemma 11. For any integer n, and distributions p and q, if for some
,
Proof. Consider the task of modifying p by at most and maximizing the increase in
. For each j and each probability
, denote by
the modified value. Depending on the location of
, there are three types of possible modifications as illustrated below.
• For the first type, we still have . This does not change the value of
and hence does not increase
.
• For the second type, we have or
. If
, this will decrease the value of
by 1 and increase the value of
or
by at most one. Hence in this case, the value of
can only decrease. If
, then
. For a particular j, all such modifications can increase the value of
by at most
2j log n, which is twice the value of
. Hence, all such modifications, when combined, increase the value of
by at most
.
• For the third type, we have and
. If i < j, we require a probability massof at least
, where
. If i > j, we require a probability mass of at least
. The number of such modifications that could lead to an increase in the value of
is at most i log n. For each i, let
denote the number of such modifications that will lead to an increase of
. Then, the total increase is
, each
is at most i log n, and the total required probability mass required is at least
.
Let be the optimal solution that maximizes
. Assume that there are two indices i < j satisfying
and
. Then, if we replace
and
by
and
, respectively,
will not change and
will decrease. Hence, we can assume that there exists
satisfying
and
. In addition, assuming
implies that
. Hence, we have
and
Profile Entropy for Structured Families
Following the study of attributes of profile entropy, we derive below nearly tight bounds on the values of three important structured families, log-concave, power-law, and histogram. These bounds tighten up and significantly improve those in Hao and Orlitsky [2019c], and show the ability of profile entropy in charactering natural shape constraints.
For the remaining sections, we follow the convention and specify structured distributions over X = Z.
B.3 Log-Concave Distributions
We say a discrete distribution is log-concave if p has a contiguous support over Z and the inequality
holds for all symbols
.
The log-concave family encompasses a broad range of discrete distributions, such as Poisson, hyperPoisson, Poisson binomial, binomial, negative binomial, geometric, and hyper-geometric, with wide applications to numerous research areas, including statistics Saumard and Wellner [2014], computer science Lovász and Vempala [2007], economics An [1997], algebra, and geometry Stanley [1989].
The next result upperly bounds the profile entropy of log-concave families, and is tight up to logarithmic factors of n. Theorem 13. For any and distribution
, if p is log-concave and has a variance of
,
Proof. The upper bound is established in Hao and Orlitsky [2019c] using some concentration attributes of the log-concave distributions.
For the other component, we can assume that and n is larger than some absolute constant. Then by Diakonikolas et al. [2016b], the maximum probability
of p belongs to
. Hence, the last index J for which
satisfies
Hence, we have
This upper bound is uniformly better than the bound in Hao and Orlitsky [2019c]. Theorem 17 further shows that it is optimal up to logarithmic factors of n.
A similar bound holds for t-mixtures of log-concave distributions. More concretely,
Theorem 14. For any integer n and distribution , if p is a t-mixture of log-concave distributions each has a variance of
, where i = 1, . . . , t,
B.4 Discretization of Continuous Distributions
The introduction about log-concave families covers numerous classical discrete distributions, yet leaves many more continuous ones untouched Bagnoli and Bergstrom [2005]. Below, we present a discretization procedure that preserves distribution shapes such as monotonicity, modality, and log-concavity. Applying this procedure to the Gaussian distribution further shows the optimality of Theorem 13.
Let X be a continuous random variable with density function f(x). For any , denote by
the closest integer z such that
. The distribution of
is over Z and satisfies
We refer to this random variable as the discretized version of X.
Shape Preservation By definition, one can readily verify that the above transformation preserves several important shape characteristics of distributions, such as monotonicity, modality, and k-modality (possibly yields a smaller k). The following theorem covers log-concavity.
Theorem 15. For any continuous random variable X over R with a log-concave density f, the distribution associated with
is also log-concave.
To show this, we need the following basic lemma about concave functions. Lemma 12. For real numbers and
satisfying
, and
,
Following the lemma, for such that
, and any function f that is log-concave,
Proof. By definition p is log-concave if p has a consecutive support and . For
, the first condition is satisfied since X is has a continuous support on R, and p(z) is positive as long as f(x) > 0 for a non-empty sub-interval of
.
Below we show that p satisfies the second condition. Specifically, for any ,
where the inequality follows by the above lemma and its implication.
Moment preservation Denote by p the distribution of for
. Let
and
be the mean and variance of density f, given that they exist. The theorem below shows that distribution p has, within small additive absolute constants, a mean of
and variance of
. Theorem 16. Under the aforementioned conditions, the mean of
satisfies
and the variance of satisfies
Proof. First consider the mean value of for
. We have
Next consider the variance of . Applying the inequality
yields
By the symmetry in the above reasoning, we also have
Consolidating these inequalities shows that
B.5 Optimality of Theorem 13
By the above formula, the discretized Gaussian has a distribution in the form of
Consolidating Theorem 15 and 16 shows that is a log-concave distribution with a variance of
. Consequently, Theorem 13 yields the following upper bound:
The optimality of Theorem 13 follows by these inequalities.
Proof. At it is clear from the context, we will write p instead of . Recall that
where denotes the number of probabilities belonging to
. Considering part of the distribution can only reduce the value of
. Hence, we focus on the symbols in the range
, over which the probability mass function p(z) is monotone.
We will further assume that , since otherwise the right-hand side of the inequality reduces to O(1), and the result follows by the fact that
for all n and p.
In addition, we focus on in the following argument, as the contribution to
from these indices is no more than the total
.
Given these assumptions, we have
where and the interval is well-defined iff
For clarity, we divide our analysis into two cases: and
.
For the first case and , the length
of the above interval, which equals to
up to an additive slack of 2, satisfies
Therefore, we have . Since
ensures
and
is equivalent to
, the lower bound on
transforms into
. Hence in this case,
admits the following bound
In the case, we have
. Repeating the previous reasoning for
, we again obtain
and
.
Therefore,
Finally, note that in the first case, , while in the second,
.
Consolidating these results yields the desired lower bound
B.6 Power-Law Distributions
We say that a discrete distribution is a power-law with power
if p has a support of [k] := {1, . . ., k} for some
and
for all
.
Power-law is a ubiquitous structure appearing in many situations of scientific interest, ranging from natural phenomena such as the initial mass function of stars Kroupa [2001], species and genera Humphries et al. [2010], rainfall Machado and Rossow [1993], population dynamics Taylor [1961], and brain surface electric potential Miller et al. [2009], to man-made circumstances such as the word frequencies in a text Baayen [2002], income rankings Dr˘agulescu and Yakovenko [2001], company sizes Axtell [2001], and internet topology Faloutsos et al. [1999].
Unlike log-concave distributions that concentrate around their mean values, power-laws are known to possess “long-tails” and always log-convex. Hence, one may expect the profile entropy of power-law distributions to behave differently from that of log-concave ones. The next theorem justifies this intuition and provides tight upper bounds.
Theorem 18. For a power-law distribution with power
, we have
where
The above upper bound fully characterizes the profile entropy of power-laws and surpasses the basic bound for both
and
. In comparison, Hao and Orlitsky [2019c] yields a
upper bound, which improves over
only for
and is worse than that above for all
.
Proof. For the ease of exposition, write the probability of symbol i assigned by distribution p as , where
is a normalizing constant (implicitly depends on k) and k can be infinite. Recall that the quantity of interest is
First consider for a sufficiently large j and note that
where . Observe that the length
of interval
, which differs from the value of
by at most 2, is proportional to
, and hence is a decreasing function of j. Furthermore, each term
is basically the minimum between this decreasing function and j log n, an increasing function of j. This naturally calls for determining the value of j at which the two functions are equal. Concretely,
Let J denote the middle quantity on the right-hand side (implicitly depends on and n). We can decompose the summation
into two parts. The first part consists of indices
,
Correspondingly, the second part consists of indices . For these indices j, we have
. Recall that
specifies the range of i satisfying
. Then the second part satisfies
where the first inequality follows by the fact that the intervals are consecutive. Also note that the boundary case j = J is covered in both
and
under different conditions. Depending on the value of the normalizing constant
, the following implications are immediate and apply to all power parameters
. If
,
These bounds can be refined given the knowledge of k. Note that for , the support size k must be finite for the normalizing constant to be well-defined. On the other hand, for
, it is common to assume
, which we adopt here.
Then, for , the above bound derivations yield
where we lower bound by
.
Next, we improve the upper bound for . Note that the normalizing constant admits
Then for , the previous upper bound yields
where we utilize the inequality . Furthermore, one can bound
by k since it is at most the sum of
. Combined, these two results yield
To complete the picture, we consider the case of . Note that
. Hence for
, the above reasoning implies that
Finally, we also analyze the case of with truncation at k. Our analysis mainly relies on the following inequality that provides a tight lower bound on
.
To summarize, for different values of , the quantity
admits
where
Note that unless and
, all the bounds are better than
. In addition, the derivation above already shows that the bounds are tight up to logarithmic factors. Reorganizing the terms yields the desired result: For any power-law distribution p with power
,
As a remark, for , the distribution becomes a uniform distribution with support size k. The above result also covers this case, for which the upper bound simplifies to
.
B.7 Histogram Distributions
A distribution is a t-histogram distribution if there is a partition of X into t parts such that p has the same probability value over all symbols in each part.
Besides the long line of research on histograms reviewed in Ioannidis [2003], the importance of histogram distributions rises with the rapid growth of data sizes in numerous engineering and science applications in the modern era.
For example, in scenarios where processing the complete data set is inefficient or even impossible, a standard solution is to partition/cluster the data into groups according to the task specifications and element similarities, and randomly sample from each group to obtain a subset of the data to use. This naturally induces a histogram distribution, with each data point being a symbol in the support.
The work of Hao and Orlitsky [2019c] studies the class of t-histogram distributions and obtains the following upper bound
Our contribution is establishing its optimality. Theorem 19. For any , there exists a t-histogram distribution p such that
Note that uniform distributions correspond to 1-histograms, for which the bounds reduce to .
Proof. Again, recall that the quantity of interest is
Our construction depends on the value of t as follows. Let denote the length-A constant sequence of value B. If t = 1, then the distribution p has the following form
where is a properly chosen probability in
so that p is well-defined, and the range of support of distribution p is irrelevant for our purpose and hence unspecified. If
, then for some parameter
to be determined, the distribution p has the following form
where the probability values are sorted according to the ordering they appear above, and L is a properly chosen to make the probabilities sum to 1. For the distribution to be well-defined, we require
where the last inequality is valid given that . Let s be the maximum integer satisfying the inequality above. Then, the quantity
admits the lower bound
Finally, if , then the distribution p has the following form
where is a properly chosen to make the probabilities sum to 1. By the reasoning for the last case, the distribution p is well-defined. In addition, the quantity
satisfies
Consolidating these results yields the desired lower bound.
C.1 Competitive Distribution Estimation
Competitive estimation calls for an estimator that competes with the instance-by-instance performance of a genie knowing more information, but reasonably restricted. Denote by the KL divergence. Introduced in Orlitsky and Suresh [2015], the formulation considers the collection N of all natural estimators, and shows that a simple variant
of the Good-Turing estimator achieves
for every distribution p and with high probability. We refer to the left-hand side as the excess loss of estimator with respect to the best natural estimator, and note that it vanishes at a rate independent of p. For a more involved estimator in Acharya et al. [2013], the excess loss vanishes at a faster rate of
, which is optimal up to logarithmic factors for every estimator and the respective worst-case distribution. For the
distance, the work of Valiant and Valiant [2016] derives a similar result.
These estimators track the loss of the best natural estimator for each distribution. Yet an equally important component, the excess loss bound, is still of the worst-case nature. For a fully adaptive guarantee, Hao and Orlitsky [2019c] design an estimator that achieves a
excess loss, i.e.,
for every p and , with high probability. Utilizing the adaptiveness of
to the simplicity of distributions, the paper derives excess-loss bounds for several important distribution families, and proves the estimator’s optimality under various of classical and modern learning frameworks.
New results While the work of Hao and Orlitsky [2019c] provides an appealing upper bound on the excess loss, it is not exactly clear how good this bound is as a matching lower bound is missing. In this work, we complete the picture by showing that the bound is essential for competitive estimation and optimal up to logarithmic factors of n. Theorem 5 (Minimal excess loss). For any
and distribution estimator
, there is a distribution p such that with probability at least 9/10, we have both
and
According to Theorem 1, we can replace by multiples
of the profile entropy in both the upper and lower bounds.
where for the total to be at most n. Let
denote the length-A constant sequence of value B. Let C be the set of distributions in the form of
where the probability values are sorted according to the ordering they appear above, L is a proper variable that makes the probabilities sum to 1, and the range of support of distribution p is irrelevant for our purpose and hence unspecified. Equip a uniform prior over C (equivalently, construct a random distribution). We have several claims in order:
• For any and
, the value of
, the total probability of symbols appearing
times, is
if
and
is chosen; and is
if
and
is chosen. Any estimator
will incur an expected absolute error of
in estimating
given
.
• Note that for any α and x, y > 0, α
.
• Therefore, the expected squared Hellinger distance of any estimator
in estimating
satisfies
C.2 Competitive Entropy Estimation
The next theorm shows that for every distribution and among all plug-in entropy estimators, the distribution estimator in Hao and Orlitsky [2019c] is as good as the one that performs best in estimating the actual distribution.
Denote by N the collection of all natural estimators. Write as
for compactness and the KL-divergence between
as
.
Theorem 4 (Competitive entropy estimation). For any distribution p, sample with profile
, and
, we have
with probability at least .
Proof. Given any natural estimator and a sample , we denote by q the distribution estimate. The entropy of q differs from the true entropy by
Denote by and
the total probability that distributions p and q assign to symbols with multiplicity
. Since q is induced by a natural estimator, we also write
for the probability that q assigns to each symbol with multiplicity
in
. Recall that prevalence
denotes the number of symbols with multiplicity
in
. Therefore,
. Henceforth, whenever it is clear from the context, we suppress
in related expressions. Then, the second term on the right-hand side satisfies
Let be the smallest nonzero probability of q. By the triangle inequality and Pinsker’s inequality,
By definition, . Now we show that if a symbol x has multiplicity
, the estimator
will assign a probability mass of
. In other words,
since
. Indeed, the corresponding KL-divergence values differ by
Then, the above equalities yield that,
Next consider the other estimator , which is also natural. Let
be the profile dimension of
. By the results in Hao and Orlitsky [2019c], estimator
achieves a
excess loss, i.e.,
for every p and , with probability at least
. In addition, by its construction, the minimum probability
is at least
. Therefore, with probability at least
,
Finally, the triangle inequality combines the above results and yields
This together with Theorem 1 completes the proof.
While a labeled sample contains all information, for many modern applications, such as property estimation and differential privacy, it is sufficient Orlitsky et al. [2004] or even necessary to provide only the profile Suresh [2019]. Hence, this section focuses on the lossless compression of profiles.
For any distribution p, it is well-known that the minimal expected codeword length (MECL) for losslessly compressing a sample is approximately nH(p), which increases linearly in n as long as H(p) is bounded away from zero.
On the other hand, by the Hardy-Ramanujan formula Hardy and Ramanujan [1918], the number P(n) of integer partitions of n, which happens to equal to the number of length-n profiles, satisfies
Consequently, the MECL for losslessly compressing the sample profile is at most
, a number potentially much smaller than nH(p).
By Shannon’s source coding theorem, the profile entropy is the information-theoretic limit of MECL for the lossless compression of profile
. Below, we present explicit block and sequential profile compression schemes achieving this entropy limit, up to logarithmic factors of n.
D.1 Block Compression
The block compression algorithm we propose is intuitive and easy to implement.
Recall that the profile of a sequence is the multiset
of multiplicities associated with symbols in
. The ordering of elements in a multiset is not informative. Hence equivalently, we can compress
into the set
of corresponding multiplicity-prevalence pairs, i.e.,
The number of pairs in is equal to the profile dimension
. In addition, both a prevalence and its multiplicity are integers in [0, n], and storing the pair takes 2 log n nats. Hence, it takes at most
nats to store the compressed profile. By Theorem 1, for any distribution
and
,
D.2 Sequential Compression
For any sequence , the setting for sequential profile compression is that at time step
, the compression algorithm knows only
and sequentially encodes the new information. This is equivalent to providing the algorithm
at time step t.
Suppress in the expressions for the ease of illustration. For efficient compression, we sequentially encode the profile
into a self-balancing binary search tree T , with each node storing a multiplicity-prevalence pair
and
being the search key. We present the algorithm details as follows.
The algorithm runs for exactly n iterations, with a O(log n) per-iteration time complexity. For an i.i.d. sample , the expected space complexity is again
.
E.1 Multi-Dimensional Profiles
As we elaborate below, the notion of profile generalizes to the multi-sequence setting.
Let X be a finite or countably infinite alphabet. For every and tuple
of sequences in
, the multiplicity
of a symbol
is the vector of its frequencies in the tuple of sequences. The profile of
is the multiset
of multiplicities of the observed symbols Acharya et al. [2010], Das [2012], Charikar et al. [2019b], and its dimension is the number
of distinct elements in the multiset. Drawing independent samples from
, the profile entropy is simply the entropy of the joint-sample profile.
Many of the previous results potentially generalize to this multi-dimensional setting. For example, thebound on
in the 1-dimensional case becomes Theorem 20. For any
, and
, there exists a positive integer r such that
and
This essentially recovers thebound for d = 1.
Proof. For simplicity, we suppress in
. Let
denote the standard d-dimensional simplex. As each multiplicity corresponds to a vector in
, in the ideal case, the profile that has the maximum dimension D corresponds to the integer vectors in the scaled simplex
, for some properly chosen parameter r. For the minimum value of such a parameter
, we have
and
Consolidating these two inequalities yields the desired result.
E.2 Discrete Multi-Variate Gaussian
Given a mean vector and covariance matrix
with eigenvalues at least 1, the corresponding discrete d-dimensional Gaussian is specified by its probability mass function
where C > 0 is a normalizing constant. Note that definition is slightly different from that induced by the discretization procedure presented in Section B.4. The reason for adopting this definition (which is also standard in literature) is to simplify the subsequent reasoning. Let be the d eigenvalues of
, where
by assumption. In this section, we show that for
,
where and
, and
is a constant that depends only on d, which appears in Lemma 14. Note that the above bound resembles that for univariate log-concave distributions (Theorem 13). This result is not included in the main paper due to the different setting. Lower bound on C First we bound the value of C from below in terms of its eigenvalues and other parameters. By symmetry, we can decompose the matrix
as
where is a diagonal matrix with
, and V is an orthonormal matrix whose i-th column is the eigenvector
associated with
.
Partition the real space into unit cubes whose vertices belong to
. For any two vectors
that belong to the same unit cube, we want to bound the ratio between
and
. Denote
and
, and express a and b as linear combinations of eigenvectors,
The log-ratio between the corresponding probabilities satisfies
Note that since
and
belong to the same unit cube. Hence, we bound the absolute value of the ratio by
Now, consider the hyper-ellipse E induced by
For any , simple algebra shows that
. Hence by the previous discussion, for any unit cube U with vertices in
, there exists a vertex
of U such that for any
,
Note that is equivalent to
. The probability mass over E is at least
Consolidating the lower and upper bounds and multiplying both sides by C yield
where the first implication follows by the lemma below.
Upper bound We proceed to bound .
Below we assume that C < n/ log n, since otherwise , yielding an O(log n) upper bound on
. Then by definition, the last index j such that
satisfies
Denote by J the quantity on the right-hand side. Then,
Furthermore, by a reasoning similar to that above, the collection of points satisfying
contributes at most O(log n) to
. Hence we need to analyze only points x satisfying p(x) > 1/(Cn). Equivalently, points in
Clearly, these points contribute at most to the sum. Noting that
is a discrete hyper-ellipse, we can bound its cardinality via the following lemma Bentkus and Götze [1997].
Lemma 14. Let be a mean vector, and
be a real covariance matrix with nonzero eigenvalues
. For any
and
, the discrete ellipsoid
admits the following inequality on its cardinality,
where is a constant that depends only on d.
For simplicity, write and
. Applying the above lemma to bound
(where t = 2 log n) and combining the result with our lower bound on C yield
where the second step follows by Lemma 13.
Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Shengjun Pan, and Narayana P Santhanam. Clas- sification using pattern probability estimators. In 2010 IEEE International Symposium on Information Theory, pages 1493–1497. IEEE, 2010.
Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. In Proceedings of the 24th Annual Conference on Learning Theory, pages 47–68, 2011.
Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Suresh. Competitive classification and closeness testing. In Conference on Learning Theory, pages 22–1, 2012.
Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal prob- ability estimation with applications to prediction and classification. In Conference on Learning Theory, pages 764–796, 2013.
Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pages 11–21, 2017.
Mark Yuying An. Log-concave probability distributions: Theory and statistical testing. Duke University Dept of Economics Working Paper, 95(3), 1997.
Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Luis Flores, Jose Antonio Lozano, Yves Van de Peer, Rosa Blanco, Víctor Robles, Concha Bielza, et al. A review of estimation of distribution algorithms in bioinformatics. BioData mining, 1(1):6, 2008.
Robert L Axtell. Zipf distribution of US firm sizes. science, 293(5536):1818–1820, 2001.
R Harald Baayen. Word frequency distributions, volume 18. Springer Science & Business Media, 2002.
Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic theory, 26(2):445–469, 2005.
Alexander Barvinok. Computing the permanent of (some) complex matrices. Foundations of Computational Mathematics, 16(2):329–342, 2016.
Alexander I Barvinok. Two algorithmic results for the traveling salesman problem. Mathematics of Operations Research, 21(1):65–84, 1996.
Vidmantas Bentkus and Friedrich Götze. On the lattice point problem for ellipsoids. Acta Arithmetica, 80(2):101–125, 1997.
Peter Bühlmann, Petros Drineas, Michael Kane, and Mark van der Laan. Learning structured distri- butions Ilias Diakonikolas. In Handbook of Big Data, pages 283–300. Chapman and Hall/CRC, 2016.
Chun Lam Chan, Winston Fernandes, Navin Kashyap, and Manjunath Krishnapur. Phase transitions for the uniform distribution in the PML problem and its Bethe approximation. arXiv preprint arXiv:1506.00753, 2015.
Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The Bethe approximation for structured matrices: an improved approximation for the profile maximum likelihood. In NeurIPS 2019 Workshop on Information Theory and Machine Learning, 2019a.
Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 780–791, 2019b.
Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999.
Hirakendu Das. Competitive tests and estimators for properties of distributions. PhD thesis, UC San Diego, 2012.
Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. arXiv preprint arXiv:1611.03579, 2016a.
Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Efficient robust proper learning of log- concave distributions. arXiv preprint arXiv:1606.03077, 2016b.
Adrian Dr˘agulescu and Victor M Yakovenko. Exponential and power-law probability distributions of wealth and income in the United Kingdom and the United States. Physica A: Statistical Mechanics and its Applications, 299(1-2):213–221, 2001.
Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law relationships of the internet topology. ACM SIGCOMM computer communication review, 29(4):251–262, 1999.
Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68–75. Springer, 2011.
Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. arXiv preprint arXiv:1802.08405, 2018.
Yi Hao and Alon Orlitsky. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems, pages 10989–11001, 2019a.
Yi Hao and Alon Orlitsky. Data amplification: Instance-optimal property estimation. arXiv preprint arXiv:1903.01432, 2019b.
Yi Hao and Alon Orlitsky. Doubly-competitive distribution estimation. In International Conference on Machine Learning, pages 2614–2623, 2019c.
Yi Hao and Alon Orlitsky. Unified sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems, pages 11104–11114, 2019d.
Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data amplification: A unified and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pages 8834–8843, 2018.
Godfrey H Hardy and Srinivasa Ramanujan. Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society, 2(1):75–115, 1918.
Jean Hausser and Korbinian Strimmer. Entropy inference and the James-Stein estimator, with appli- cation to nonlinear gene association networks. Journal of Machine Learning Research, 10(Jul): 1469–1484, 2009.
Erwan Hillion, Oliver Johnson, et al. A proof of the Shepp–Olkin entropy monotonicity conjecture. Electronic Journal of Probability, 24, 2019.
Nicolas E Humphries, Nuno Queiroz, Jennifer RM Dyer, Nicolas G Pade, Michael K Musyl, Kurt M Schaefer, Daniel W Fuller, Juerg M Brunnschweiler, Thomas K Doyle, Jonathan DR Houghton, et al. Environmental context explains Lévy and Brownian movement patterns of marine predators. Nature, 465(7301):1066–1069, 2010.
Yannis Ioannidis. The history of histograms (abridged). In Proceedings 2003 VLDB Conference, pages 19–30. Elsevier, 2003.
Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Weihao Kong and Gregory Valiant. Estimating learnability in the sublinear data regime. In Advances in Neural Information Processing Systems, pages 5455–5464, 2018.
Weihao Kong, Gregory Valiant, and Emma Brunskill. Sublinear optimal policy value estimation in contextual bandits. arXiv preprint arXiv:1912.06111, 2019.
Pavel Kroupa. On the variation of the initial mass function. Monthly Notices of the Royal Astronomical Society, 322(2):231–246, 2001.
Erich Leo Lehmann. Some concepts of dependence. The Annals of Mathematical Statistics, pages 1137–1153, 1966.
László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algo- rithms. Random Structures & Algorithms, 30(3):307–358, 2007.
LAT Machado and WB Rossow. Structural characteristics and radiative properties of tropical cloud clusters. Monthly Weather Review, 121(12):3234–3260, 1993.
Anne E Magurran. Measuring biological diversity. John Wiley & Sons, 2013.
Kai J Miller, Larry B Sorensen, Jeffrey G Ojemann, and Marcel Den Nijs. Power-law scaling in the brain surface electric potential. PLoS computational biology, 5(12), 2009.
Michael Mitzenmacher. A brief history of generative models for power law and lognormal distribu- tions. Internet mathematics, 1(2):226–251, 2004.
Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is Good- Turing good. In Advances in Neural Information Processing Systems, pages 2143–2151, 2015.
Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 426–435. AUAI Press, 2004.
Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.
Thomas Schürmann and Peter Grassberger. Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414–427, 1996.
Richard P Stanley. Log-concave and unimodal sequences in algebra, combinatorics, and geometry. Ann. New York Acad. Sci, 576(1):500–535, 1989.
Ananda Theertha Suresh. Differentially private anonymized histograms. In Advances in Neural Information Processing Systems, pages 7969–7979, 2019.
Lionel Roy Taylor. Aggregation, variance and the mean. Nature, 189(4766):732–735, 1961.
Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685–694, 2011.
Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142–155, 2016.
Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pages 2157–2165, 2013.
Sergio Verdú. Empirical estimation of information measures: A literature guide. Entropy, 21(8): 720, 2019.
Pascal O Vontobel. The Bethe approximation of the pattern maximum likelihood distribution. In 2012 IEEE International Symposium on Information Theory Proceedings. IEEE, 2012.
Pascal O Vontobel. The Bethe and Sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the Valiant-Valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2014.