b

DiscoverSearch
About
My stuff
Testing Identity of Multidimensional Histograms
2018·arXiv
Abstract
Abstract

We investigate the problem of identity testing for multidimensional histogram distributions. A distribution  p : D → R+, where D ⊆ Rd, is called a k-histogram if there exists a partition of the domain into k axis-aligned rectangles such that p is constant within each such rectangle. Histograms are one of the most fundamental nonparametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with sub-learning sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound.

In more detail, let q be an unknown d-dimensional k-histogram distribution in fixed dimension d, and p be an explicitly given d-dimensional k-histogram. We want to correctly distinguish, with probability at least 2/3, between the case that  p = q versus ∥p − q∥1 ≥ ϵ. We design an algorithm for this hypothesis testing problem with sample complexity  O((√k/ϵ2)2d/2 log2.5d(k/ϵ))that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if q is only promised to be close to a k-histogram. Moreover, for  k = 2Ω(d), we show a sample complexity lower bound of (√k/ϵ2) · Ω(log(k)/d)d−1 when d ≥ 2. That is, for any fixed dimension d, our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the d = 1 case was well-understood, but no algorithm with sub-learning sample complexity was known, even for d = 2. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.

1.1 Background

The task of verifying the identity of a statistical model — known as identity testing or goodness of fit — is one of the most fundamental questions in statistical hypothesis testing [Pea00, NP33]. In the past two decades, this question has been extensively studied by the TCS and information-theory communities in the framework of property testing [RS96, GGR98]: Given sample access to an unknown distribution q over a finite domain [n] := {1, . . . , n}, an explicit distribution p over [n], and a parameter  ϵ > 0, we want to distinguish between the cases that q and p are identical versus  ϵ-far from each other in  ℓ1-norm (statistical distance). Initial work on this problem focused on characterizing the sample size needed to test the identity of an arbitrary distribution of a given support size n. This regime is well-understood: there exists an efficient estimator with sample complexity  O(√n/ϵ2)[VV14, DKN15b, ADK15] that is worst-case optimal up to constant factors.

The aforementioned sample complexity characterizes worst-case instances and drastically better upper bounds may be possible if we have some a priori qualitative information about the unknown distribution. For example, if q is an arbitrary continuous distribution, no identity tester with finite sample complexity exists. On the other hand, if q is known to have some nice structure, the domain size may not be the right complexity measure for the identity testing problem and one might hope that strong positive results can be obtained even for the continuous setting. This discussion motivates the following natural question: To what extent can we exploit the underlying structure to perform the desired statistical estimation task more efficiently?

A natural formalization of the aforementioned question involves assuming that the unknown distribution belongs to (or is close to) a given family of distributions. Let D be a family of distributions over  Rd. Theproblem of identity testing for D is the following: Given sample access to an unknown distribution  q ∈ D,and an explicit distribution  p ∈ D, we want to distinguish between the case that  q = p versus ∥q − p∥1 ≥ ϵ.(Throughout this paper,  ∥p − q∥1denotes the  L1-distance between the distributions p, q.) We note that the sample complexity of this testing problem depends on the complexity of the underlying class D, and it is of fundamental interest to obtain efficient algorithms that are sample optimal for D. A recent body of work in distribution testing has focused on leveraging such a priori structure to obtain significantly improved sample complexities [BKR04, DDS+13, DKN15b, DKN15a, CDKS17a, DP17, DDK18, DKN17].

One approach to solve the identity testing problem for a family D is to learn  q up to L1-distance ϵ/3and then check (without drawing any more samples) whether the hypothesis is  ϵ/3-close to p. Thus, the sample complexity of identity testing for D is bounded from above by the sample complexity of learning (an arbitrary distribution in) D. It is natural to ask whether a better sample size bound could be achieved for the identity testing problem, since this task is, in some sense, less demanding than the task of learning. In this paper, we provide an affirmative answer to this question for the family of multidimensional histogram distributions.

1.2 Our Results: Identity Testing for Multidimensional Histograms

In this work, we investigate the problem of identity testing for multidimensional histogram distributions. A d-dimensional probability distribution with density  p : D → R, where D ⊂ Rd is either [m]d or [0, 1]d, iscalled a k-histogram if there exists a partition of the domain into k axis-aligned rectangles  R1, . . . , Rk suchthat p is constant on  Ri, for all i = 1, . . . , k. We let Hdk(D)denote the set of k-histograms over D. We will use the simplified notation  Hdk when the underlying domain is clear from the context. Histograms constitute one of the most basic nonparametric distribution families and have been extensively studied in statistics and computer science.

Specifically, the problem of learning histogram distributions from samples has been extensively studied in the statistics community and many methods have been proposed [Sco79, FD81, Sco92, LN96, DL04, WN07, Kle09] that unfortunately have a strongly exponential dependence on the dimension. In the database community, histograms [JKM+98, CMN98, TGIK02, GGI+02, GKS06, ILR12, ADH+15] constitute the most common tool for the succinct approximation of large datasets. Succinct multivariate histograms representations are well-motivated in several data analysis applications in databases, where randomness is used to subsample a large dataset [CGHJ12].

In recent years, histogram distributions have attracted renewed interested from the TCS community in the context of learning [DDS12, CDSS13, CDSS14a, CDSS14b, DHS15, ADLS16, ADLS17, DKS16a, DLS18] and testing [ILR12, DDS+13, DKN15b, DKN15a, Can16, CDGR16, DKN17]. The algorithmic difficulty in learning and testing such distributions lies in the fact that the location and size of the rectangle partition is unknown. The majority of the literature has focused on the univariate setting which is by now well-understood. Specifically, it is known that the sample complexity of learning  H1k is Θ(k/ϵ2) (and thissample bound is achievable with computationally efficient algorithms [CDSS14a, CDSS14b, ADLS17]); while the sample complexity of identity testing  H1k is Θ(√k/ϵ2)[DKN15b]. That is, in one dimension, the gap between learning and identity testing as a function of the complexity parameter k is known to be quadratic.

A recent work [DLS18] obtained a sample near-optimal and computationally efficient algorithm for learning multidimensional k-histograms in any fixed dimension. The sample complexity of the [DLS18] algorithm is  O((k/ϵ2) logO(d)(k/ϵ))while the optimal sample complexity of the learning problem (ignoring computational considerations) is �Θ(dk/ϵ2)1. On the other hand, the property testing question in two (or more) dimensions is poorly understood. In particular, prior to this work, no testing algorithm with sub-learning sample complexity was known, even for d = 2 (independent of computational considerations). In this paper, we obtain an identity tester for multidimensional histograms in any fixed dimension with sub-learning sample complexity and establish a nearly-matching sample complexity lower bound (that applies even to the special case of uniformity testing). Our main result is the following:

Theorem 1.1 (Main Result). Let  ϵ > 0 and k ∈ Z+. Let q ∈ Hdk(D)be an unknown k-histogram dis- tribution over  D = [0, 1]d or D = [m]d, where dis fixed, and  p ∈ Hdk(D)be explicitly given. There is an algorithm which draws  m = O((√k/ϵ2)2d/2 log2.5d(dk/ϵ))samples from q, runs in sample-polynomial time, and distinguishes, with probability at least 2/3, between the case that  p = q versus ∥p − q∥1 ≥ ϵ.Moreover, any algorithm for this hypothesis testing problem requires  (√k/ϵ2)Ω(log(k)/d)d−1 samples for k = 2Ω(d), even for uniformity testing.

A few remarks are in order: First, we emphasize that the focus of our work is on the case where the parameter k is much larger than the dimension d. For example, this condition is automatically satisfied when d is bounded from above by a fixed constant. We note that understanding the regime of fixed dimension d is of fundamental importance, as it is the most commonly studied setting in nonparametric inference. Moreover, in several of the classical database and streaming applications of multidimensional histograms (see, e.g., [PI97, GKTD00, BCG01, Mut05] and references therein) the dimension d is relatively small (at most 10), while the number of rectangles is orders of magnitude larger . For such parameter regimes, our identity tester has sub-learning sample complexity that is near-optimal, up to the precise power of the logarithm (as follows from our lower bound). Understanding the parameter regime where k and d are comparable, e.g., k = poly(d), is left as an interesting open problem.

It is important to note that our identity testing algorithm is robust to model misspecification. Specifically, the algorithm is guaranteed to succeed as long as the unknown distribution  q is ϵ/10-close, in L1-norm, to being a k-histogram. This robustness property is important in applications and is conceptually interesting for the following reason: In high-dimensions, robust identity testing with sub-learning sample complexity is provably impossible, even for the simplest high-dimensional distributions, including spherical Gaussians [DKS16b].

A conceptual implication of Theorem 1.1 concerns the sample complexity gap between learning and identity testing for histograms. It was known prior to this work that the gap between the sample complexity of learning and identity testing for univariate k-histograms is quadratic as a function of k. Perhaps surprisingly, our results imply that this gap decreases as the dimension d increases (as long as the dimension remains fixed). This follows from our sample complexity lower bound in Theorem 1.1 and the fact that the sample complexity of learning  Hdk is ˜Θ(dk/ϵ2)(as follows from standard VC-dimension arguments, see, e.g., [DLS18]). In particular, even for d = 3, the gap between the sample complexities of learning and identity testing is already sub-quadratic and continues to decrease as the dimension increases. (We remind the reader that our lower bound applies for  k > 2Ω(d).)

Finally, we note here a qualitative difference between the  d = 1 and d ≥ 2cases. Recall that for d = 1 the sample complexity of identity testing k-histograms is  Θ(√k/ϵ2). For d = 2, the sample complexity of our algorithm is  O((√k/ϵ2) log5(k/ϵ)). It would be tempting to conjecture that the multiplicative logarithmic factor is an artifact of our algorithm and/or its analysis. Our lower bound of  Ω((√k/ϵ2) log(k)) showsthat some constant power of a logarithm is in fact necessary.

1.3 Related Work

The field of distribution property testing [BFR+00] has been extensively investigated in the past couple of decades, see [Rub12, Can15, Gol17]. A large body of the literature has focused on characterizing the sample size needed to test properties of arbitrary discrete distributions. This regime is fairly well understood: for many properties of interest there exist sample-efficient testers [Pan08, CDVV14, VV14, DKN15b, ADK15, CDGR16, DK16, DGPP16, CDS17, Gol17, DGPP17, BC17, DKS18, CDKS17b]. More recently, an emerging body of work has focused on leveraging a priori structure of the underlying distributions to obtain significantly improved sample complexities [BKR04, DDS+13, DKN15b, DKN15a, CDKS17a, DP17, DDK18, DKN17].

The area of distribution inference under structural assumptions — that is, inference about a distribution under the constraint that its probability density function satisfies certain qualitative properties — is a classical topic in statistics starting with the pioneering work of Grenander [Gre56] on monotone distributions. The reader is referred to [BBBB72] for a summary of the early work and to [GJ14] for a recent book on the subject. This topic is well-motivated in its own right, and has seen a recent surge of research activity in the statistics and econometrics communities, due to the ubiquity of structured distributions in the sciences. The conventional wisdom is that, under such structural constraints, the quality of the resulting estimators may dramatically improve, both in terms of sample size and in terms of computational efficiency.

1.4 Basic Notation

We will use p, q to denote the probability density functions (or probability mass functions) of our distributions. If p is discrete over support  [n] def= {1, . . . , n}, we denote by  pithe probability of element i in the distribution. For discrete distributions  p, q, their ℓ1 and ℓ2distances are  ∥p − q∥1 = �ni=1 |pi − qi|and  ∥p − q∥2 = ��ni=1(pi − qi)2. For D ⊆ Rdand density functions  p, q : D → R+, we have∥p − q∥1 =�D |p(x) − q(x)|dx. The total variation distance between distributions p, q is defined to be dTV (p, q) = (1/2) · ∥p − q∥1.

Fix a partition of the domain D into disjoint sets  S := (Si)ℓi=1.For such a partition S, the reduced distribution  pSr corresponding to p and S is the discrete distribution over  [ℓ]that assigns the i-th “point” the mass that p assigns to the set  Si; i.e., for  i ∈ [ℓ], pSr (i) = p(Si).

Our lower bound proofs will use the following metric, which can be seen as a generalization of the chi-square distance: For probability distributions  p, q and r let χp(q, r) def=� dqdrdp .

1.5 Overview of Techniques

In this section, we provide a high-level overview of our algorithmic and lower bounds techniques in tandem with a comparison to prior related work.

Overview of Identity Testing Algorithm We start by describing our uniformity tester for d-dimensional k-histograms. For the rest of this intuitive description, we focus on histograms over  [0, 1]d. A standard, yet important, tool we will use is the concept of a reduced distribution defined above. Note that a random sample from the reduced distribution  pSr can be obtained by taking a random sample from p and returning the element of the partition that contains the sample.

The first observation is that if the unknown distribution  q ∈ Hdk and the uniform distribution p = U are  ϵ-far in L1-distance, there exists a partition of the domain into k rectangles  R1, . . . , Rksuch that the difference between q and p can be detected based on the reduced distributions on this partition. If we knew the partition  R1, . . . , Rkahead of time, the testing problem would be easy: Since the reduced distributions have support k, this would yield a uniformity tester with sample complexity  O(√k/ϵ2). The main difficulty is that the correct partition is unknown to the testing algorithm (as it depends on the unknown histogram distribution q).

A natural approach, employed in [DKN15b] for d = 1, is to appropriately “guess” the correct rectangle partition. For the univariate case, a single interval partition already leads to a non-trivial uniformity tester. Indeed, consider partitioning the domain into  Θ(k/ϵ)intervals of equal length (hence, of equal mass under the uniform distribution). It is not hard to see that the reduced distributions over these intervals can detect the discrepancy between q and p, leading to a uniformity tester with sample complexity  Θ((k/ϵ)1/2/ϵ2) =Θ(k1/2/ϵ5/2). This very simple scheme gives an identity testing with sub-learning sample complexity when ϵis constant — albeit suboptimal for small  ϵ. Unfortunately, such an approach can be seen to inherently fail even for two dimensions: Any obliviously chosen partition in two dimensions requires  Ω(k2/ϵ2)rectangles, which leads to an identity tester with sample complexity  Ω(k/ϵ3). Hence, a more sophisticated approach is required in two dimensions to obtain any improvement over learning.

Instead of using a single oblivious interval decomposition of the domain, the sample-optimal  Θ(k1/2/ϵ2)uniformity tester of [DKN15b] for univariate k-histograms partitions the domain into intervals in several different ways, and runs a known  ℓ2-tester on the reduced distributions (with respect to the intervals in the partition) as a black-box. At a high-level, we appropriately generalize this idea to the multidimensional setting.

To achieve this, we proceed by partitioning the domain into approximately k identical rectangles, distinguishing the different partitions based on the shapes of these rectangles. This requirement to guess the shape is necessary, as for example partitioning the square into rows will not suffice when the true partition is a partition into columns. We show that it suffices to consider a poly-logarithmic sized set of partitions, where any desired shape of rectangle can be achieved to within a factor of 2. In particular, we show that for each of the k rectangles in the true partition that are sufficiently large, at least one of our oblivious partitions will use rectangles of approximately the same size, and thus at least one rectangle in this partition will approximately capture the discrepancy due to this rectangle (note that only considering large rectangles suf-fices, since any rectangle on which the uniform distribution assigns substantially more mass than q must be reasonably large). This means that at least one partition will have an  ϵ/polylog(k/ϵ)discrepancy between p and q, and by running an identity tester on this partition, we can distinguish them.

One complication that arises here is that for small values of  ϵ, the difference between p and q might be due to rectangles with area much less than 1/k. In order to capture these rectangles, we will need some of our oblivious partitions to be into rectangles with area smaller than 1/k, for which there will necessarily be more than k rectangles in the partition (in fact, as many as  k/ϵmany rectangles). This would appear to cause problems for the following reason: the sample complexity of  ℓ1-uniformity testing over a discrete domain of size  n is Θ(n1/2/ϵ2). Hence, naively using such a uniformity tester on the reduced distributions obtained by a decomposition into  k/ϵrectangles would lead to the sub-optimal sample complexity of  Θ((k/ϵ)1/2/ϵ2) = Θ(k1/2/ϵ5/2).

We can circumvent this difficulty by leveraging the following insight: Even though the total number of rectangles in the partition might be large, it can be shown that for a well-chosen oblivious partition, a reasonable fraction of this discrepancy is captured by only k of these rectangles. In such a case, the sample complexity of uniformity testing can be notably reduced using an “ℓk1-identity tester” — an identity tester under a modified metric that measures the discrepancy of the largest k domain elements. By leveraging the flattening method of [DK16], we design such a tester with the optimal sample complexity of  O(√k/ϵ2)(Theorem 2.3) — independent of the domain size. This completes the sketch of our uniformity tester for the multidimensional case.

To generalize our uniformity tester to an identity tester for multidimensional histograms, two significant problems arise. The first is that it is no longer clear what the shape of rectangles in the oblivious partition should be. This is because when the explicit distribution p is not the uniform distribution, equally sized rectangles are not a natural option to consider. This problem can be fixed by breaking the axes into pieces that assign equal mass to the marginals of the known distribution (Lemma 2.8). The more substantial problem is that it is no longer clear that the discrepancy between p and q can be captured by a partition of the square into k rectangles. This is because the two k rectangle partitions corresponding to the k-histograms p and q when refined could lead to a partition of the square into as many  k2 rectangles.

To remedy this, we note that there is still a partition into k rectangles such that q is piecewise constant on that partition. We show (Lemma 2.4) that if we refine this partition slightly — by dividing each region into two regions, the half on which p is heaviest and the half on which p is lightest — this new partition will capture a constant fraction of the difference between p and q. Given this structural result, our identity testing algorithm becomes similar to our uniformity tester. We obliviously partition our domain into rectangles poly-logarithmically many times, each time we now divide each rectangle further into two regions as described above, and then run identity testers on these partitions. We show that if  p and q differ by ϵ inL1-distance, then at least one such partition will detect at least  ϵ/polylog(k/ϵ)of this discrepancy.

Overview of Sample Complexity Lower Bound Note that  Ω(√k/ϵ2)is a straightforward lower bound on the sample complexity of identity testing k-histograms, even for d = 1. This follows from the fact that a k-histogram can simulate any discrete distribution over k elements.

In order to prove lower bounds of the form  ω(√k/ϵ2), we need to show that any tester must consider many possible shapes of rectangles. This suggests a construction where we have a grid of some unknown dimensions, where some squares in the grid are dense and the remainders are sparse in a checkerboard-like pattern. It should be noted that if we have two such grids whose dimensions differ by exactly a factor of 2, it can be arranged such that the distributions are exactly uncorrelated with each other. Using this observation, we can construct log(k) such uncorrelated distributions that the tester will need to check for individually. Unfortunately, this simple construction will not suffice to prove our desired lower bound, as one could merely run log(k) different testers in parallel. (We note, however, that this construction does yield a non-trivial lower bound, see Proposition 3.3.) We will thus need a slightly more elaborate construction, which we now describe: First, we divide the square domain into polylog(k) equal regions. Each of these regions is turned into one of these randomly-sized checkerboards, but where different regions will have different scales. We claim that this ensemble is hard to distinguish from the uniform distribution.

The formal proof of the above sketched lower bound is somewhat technical and involves bounding the

chi-squared distance of taking Poi(m) samples from a random distribution in our ensemble with respect to the distribution obtained by taking Poi(m) samples from the uniform distribution.

Bounding the chi-square distance is simplified by noting that since the sets of samples from each of the √kbins are independent of each other, we can consider each of them independently. For each individual bin, we take  s ∼ Poi(m/√k)samples and need to compute  χU⊗s(X⊗s, Y ⊗s) = χU(X, Y )s, where X andY are random distributions from our ensemble and U is the uniform distribution. It is not hard to see that if X and Y are checkerboards of different scales, then the  χ2-value is exactly 1. This saves us a factor of log(k), as there are log(k) many different scales to consider, and leads to the desired sample lower bound (Theorem 3.4).

Organization In Section 2, we give our identity testing algorithm. Our sample complexity lower bound proof is given in Section 3. Finally, Section 4 outlines some directions for future work.

In Section 2.1, we describe and analyze our identity tester, assuming the existence of a good oblivious covering. In Section 2.2, we show the existence of such a covering.

2.1 Algorithm and its Analysis

Let q be the unknown histogram distribution and p be the explicitly known one. Our algorithm considers several judiciously chosen oblivious decompositions of the domain that will be able to approximate a set on which we can distinguish our distributions. We formalize the properties that we need these decompositions to have with the notion of a good oblivious covering (Definition 2.1 below). The essential idea is that we cover the domain  [0, 1]d with rectangles that do not overlap too much in such a way so that any partition of [0, 1]d into krectangles can be approximated by some union of rectangles in this family.

Definition 2.1 (good oblivious covering). Let p be a probability distribution on  [0, 1]d. For k, j, ℓ ∈ Z+ and0 < ϵ ≤ 1/2, a (k, j, ℓ, ϵ)-oblivious covering of p is a family F of subsets of  [0, 1]d satisfying the following:

image

(d) For each  S ∈ Sthere is some histogram rectangle  R ∈ Π such that Sonly contains points from R, i.e., S ⊆ R.

image

In Section 2.2, Lemma 2.8, we establish the existence of a  (k, 2d logd(4kd/ϵ), logd(4kd/ϵ), ϵ)-obliviouscovering of p for any distribution  p on [0, 1]d and for all  k, d, ϵ with ϵ ≤ 1/2.

Our basic plan will be that if p is a distribution with a  (k, j, ℓ, ϵ/2)-oblivious covering F, and q is a k-histogram that differs from p by at least  ϵ in L1-distance, then q defines a partition  Π of [0, 1]d into krectangles. This partition gives rise to a subfamily  S ⊆ Fsatisfying the constraints specified in Defini-tion 2.1. We would like to show that a constant fraction of the discrepancy between p and q can be detected by considering their restrictions to S. There are a couple of obstacles to showing this, the first of which is that we do not know what S is. Fortunately, we do have the guarantee that |S| is relatively small. We can consider the restrictions of p and q over all sets in S and try to check if there is a significant discrepancy between the two coming from any small subset. To achieve this, we will make essential use of an identity tester under the  ℓ1k-metric, which we now define:

Definition 2.2 (ℓk1-distance). Let p and qbe distributions on a finite size domain, that we denote by [n] without loss od generality. For any positive integer  k ≥ 1, we define ∥p − q∥1,kas the sum of the largest k values of  |p(i) − q(i)| over i ∈ [n].

image

Theorem 2.3 (Sample-Optimal  ℓk1 Identity Testing). Given a known discrete distribution p and sample access to an unknown discrete distribution q, each of any finite domain size, there exists an algorithm that accepts with probability 2/3 if p = q and rejects with probability  2/3 if ∥p − q∥1,k ≥ ϵ.The tester requires only knowledge of the known distribution  p and O(√k/ϵ2)samples from q.

Recall that if we wanted to distinguish between  p = q and ∥p − q∥1 > ϵ, this would require  Ω(√n/ϵ2)samples. However, the optimal  ℓ1-identity testers are essentially adaptations of  ℓ2-testers. That is, roughly speaking, they actually distinguish between  p = q and ∥p − q∥2 > ϵ/√n. Hence, it should be intuitively clear why it would be easier to test for discrepancies in  ℓk1-distance: If  ∥p−q∥1,k > ϵ, then ∥p−q∥2 > ϵ/√k,making it easier for an  ℓ2-type tester to detect the difference. We apply the flattening technique of [DK16] combined with the  ℓ2-tester of [CDVV14] to obtain our optimal  ℓk1-identity tester. We note that an optimal ℓk1 closeness tester between discrete distributions was given in [DKN17]. The proof of Theorem 2.3 follows along the same lines and is given in Appendix A.

The second obstacle is that although q will be constant within each  S ∈ S, it will not necessarily be the case that p(S) and q(S) will differ substantially even if the variation distance between p and q on S is large. To fix this, we show that S can be split into two parts such that at least one of the two parts will necessarily detect a large fraction of this difference:

Lemma 2.4. Let  p, q : Rd → R+ and let Sbe a bounded open subset in  Rd on which qis uniform. Suppose S is partitioned into two subsets  S1, S2 such that vol(S1) = vol(S2) = vol(S)/2 and p(s1) ≥ p(s2) for alls1 ∈ S1, s2 ∈ S2, where vol()denotes Euclidean volume. Then,

image

We will show that

image

By an argument analogous to the one we will give to prove Equation (1), one can also prove that

image

Combining the above will give Lemma 2.4.

image

two on the right hand side. Similarly, if  S1 ⊆ S ∩ W, then it also holds (but this time with the factor of two). To show this, note that

image

The RHS is a sum of two integrals where the second integral’s integrand is always smaller than the smallest value of the first integral’s integrand. Furthermore, the second integral is over a region that is no larger than the first region of the first integral, because  vol(S1) = vol(S)/2, while vol(S2∩W) ≤ vol(S2) = vol(S)/2.Thus, we have

image

which implies Equation (1). The final case needed to prove Equation (1) holds is when  S1 ∩ W ⊊ S1, which is equivalent to saying that  S1contains points  x for which p(x) < q(x). Let h = −�S1∩W ′(p(x) − q(x))dx ≥ 0. Then we have

image

If  h ≤�W (p(x) − q(x))dx/2, then we can substitute this into the preceding equation and we are done. Otherwise,  h >�W (p(x) − q(x))dx/2. Note that in this case,  |�S2(p(x) − q(x))dx| ≥ h. 2Putting these together gives

image

completing the proof.

We can now state the main algorithmic result of this section:

Theorem 2.5. Let p be a known distribution on  [0, 1]d with a (k, j, ℓ, ϵ/2)-oblivious covering. There exists a tester that given sample access to an unknown  k-histogram q on [0, 1]d distinguishes between p = q and dTV (p, q) ≥ ϵwith probability at least  2/3 using O(√kj · ℓ2/ϵ2) samples.

Plugging in the bounds on  j and ℓ of 2d logd(kd/ϵ) and logd(kd/ϵ)from Lemma 2.8 (established in Section 2.2) yields a sample complexity upper bound of  O(√k2d/2 log2.5d(kd/ϵ)/ϵ2) for ϵ ≤ 1/2. Thisgives the upper bound portion of Theorem 1.1.

The high-level idea of the algorithm establishing Theorem 2.5 is to take each element of the oblivious cover and divide it in two, as in Lemma 2.4, and then use the tester from Theorem 2.3 on the induced distributions of p and q on the resulting sets. The algorithm itself is quite simple and is presented in pseudo-code below.

Proof of Theorem 2.5. We note that the sample complexity of the tester described in Algorithm 1 is  O(√kjℓ2/ϵ2),as desired. It remains to prove correctness.

The completeness case is straightforward. If p = q, then clearly  p′ = q′ and our tester will accept with probability at least 2/3.

image

We now proceed to prove soundness. If  dTV (p, q) ≥ ϵ, we claim that our tester will reject with probability at least 2/3. For this we note that the unknown distribution q defines some partition  Π of [0, 1]d into krectangles such that q is constant on each part of the partition. By the definition of an oblivious cover, there is a subfamily of disjoint sets  S ⊆ Fsuch that:

q is constant on each element of S.

• |S| ≤  k · j.

Letting  V = �S∈S S, we have that  p (V ) ≥ 1 − ϵ/2.

Since  ϵ = dTV (p, q) =�[0,1]d max(p−q, 0)dx,we have that�V max(p−q, 0)dx ≥ ϵ−�[0,1]d\V pdx ≥ ϵ/2.Therefore, since the elements of S are disjoint, we have that �S∈S�S |p − q|dx ≥ ϵ/2.We now let A ⊆ F′ be the collection of all  S1 or S2corresponding to an  S ∈ S. We note that |A| = 2|S| ≤ 2k · j. Furthermore, by Lemma 2.4, we have that

image

image

Figure 1: z-grids for different values of  z partition [0, 1]2 into rectangles. In this figure, the axes are scaled such that the marginal distributions of the vertical and horizontal coordinates, respectively, of p are uniform.

Remark 2.6. Algorithm 1 is robust in the sense that it still works even if q is only (say)  ϵ/10-close to some k-histogram distribution  �qinstead of actually being one. To show this, one can note that the existing proof applied to  p and �q gives an A such that �A∈A |p(A) − �q(A)|is at least  ϵ/8. The triangle inequality then implies �A∈A |p(A) − q(A)| ≥ ϵ/40, which, by the same reasoning given in the proof of the non-robust case, implies the algorithm is still correct.

Remark 2.7. Even though our testing algorithm was phrased for histograms over  [0, 1]d, it can be made to apply for discrete histograms on  [m]d via a simple reduction. In particular, if each element of  [m]d isreplaced by a box of side length 1/m on each side, a k-histogram on  [m]d is transformed into a k-histogram over  [0, 1]d, in a way that preserves total variation distance. If our algorithm is applied to the latter histogram, we can obtain correct results for the former.

2.2 Construction of Good Oblivious Covering

In this section, we prove the existence of an oblivious covering:

Lemma 2.8. For any continuous distribution  p on [0, 1]d, positive integer  k and ϵ ≤ 1, there exists a �k, 2d logd(4kd/ϵ), logd(4kd/ϵ), ϵ�-oblivious covering of p.

Proof. The basic idea of our construction will be to let F be a union of grids where the number of cells in each direction is a power of 2.

For each coordinate,  j ∈ [d], and each non-negative integer i, define the  ith partition of this coordinate to be a partition of  [0, 1] into 2i intervals such that  jth marginal, pj, of passigns each interval in the partition equal mass, and such that the  ith partition is a refinement of the  (i − 1)st.

For each vector  z ∈ Nd, define the z-grid as the partition of  [0, 1]d into rectangles by taking the product of the  zthjpartition of the  jth coordinate. We let F be the union of the cells in the z-grid for all  z ∈Nd ∩ [0, m − 1] for m = log2(4kd/ϵ). An illustration is given in Figure 1.

We note that each  x ∈ [0, 1]d is in exactly one cell in each z-grid, and therefore is contained in exactly md elements of F, verifying Property 2.

For Property 1, consider a partition of  [0, 1]d into rectangles  R1, . . . , Rk. We claim that for each  Ri thereis a subfamily  Ti ⊆ Fof disjoint subsets of  Ri with |Ti| ≤ 2dmd, and such that  p�Ri\ �S∈Ti S�≤ ϵ/k. Itis then clear that taking S to be the union of the  Tiwill suffice. In fact, we will show that for any rectangle Ri, there is a corresponding  Tiwith these properties.

We let  Ri = �dj=1 Ijfor intervals  Ij. We let I′j be Ijminus the intervals of the  (m − 1)st-partition of the  jth coordinate that contain the endpoints of  Ij. We note that  pj(Ij\I′j) ≤ ϵ/(kd) and that I′j is a union

image

Figure 2: How our oblivious covering is used to cover a rectangle  Riin the proof of Lemma 2.8. Each dimension of  Riis separately decomposed into non-overlapping one-dimensional rectangles, with a small amount of area shaded in beige left over on the sides.  Tiis obtained by taking the family of all Cartesian products of the form  I′′1 × · · · × I′′d where, for each  j, I′′j is any subinterval in the decomposition of  I′j. Inthis figure, the axes are scaled such that the marginal distributions of the vertical and horizontal coordinates, respectively, of p are uniform.

of consecutive intervals in the  (m − 1)st partition of this coordinate. We claim that this means that  I′j is theunion of at most 2m intervals of one of the first  m − 1partitions of the  jth coordinate. This is easy to see by induction on  m, as I′j is a union of consecutive intervals in the  (m − 2)ndpartition union at most one interval of the  (m − 1)st on either end. The one-dimensional intervals on the top and left of Figure 2 show an illustration of this.

In order to produce  Ti, we write each  I′j as a union of at most 2m intervals from the relevant partitions. We let  Tibe the set of rectangles obtained by taking the product of one rectangle from each of these sets. It is then clear that  Tipartitions �dj=1 I′j into at most  (2m)dpieces. Figure 2 shows an illustration of this. We now note that

image

Thus, T satisfies all of the desired properties, and taking the union of the  Tiwill yield an appropriate S. This completes the proof of Lemma 2.8.

In this section, we prove the sample complexity lower bound of Theorem 1.1. The structure of this section is as follows: We begin (Proposition 3.2) by providing a new proof that  Ω(√k/ϵ2)samples are required to test uniformity of a k-histogram in one dimension. The purpose of reproving this previously known result is so that we may later generalize it to higher dimensions. We then proceed by describing a basic construction that yields a slightly improved lower bound (Proposition 3.3). Finally, we present a more sophisticated construction that suffices to establish our final lower bound in Theorem 3.4.

Basic Background. Recall the definition of the  χ-metric. Notice that, for fixed  q, χp(q, r)is an inner product on distributions q, r. Furthermore, by the Cauchy-Schwarz inequality it follows that if q and p are probability distributions then

image

This metric is useful for determining whether or not distributions can be distinguished. In particular, if q and p can be distinguished from a single sample, it must be the case that  χp(q, q)is much bigger than 1. Formally, we have:

Lemma 3.1. Suppose that q and p are probability distributions. Suppose furthermore that there is an algorithm that given a random sample from q accepts with probability at least 2/3, and given a random sample from p rejects with probability at least 2/3. Then, it holds that  χp(q, q) ≥ 4/3.

Proof. Let A be the set on which the algorithm accepts. We then have that  q(A) ≥ 2/3 and p(A) ≤ 1/3.Therefore, we have that

image

3.1 Lower Bound for Uniformity Testing of Univariate Histograms

We start by using Lemma 3.1 to prove a lower bound on the number of samples required to test uniformity of univariate k-histograms. We build on this argument in the following subsections to establish our final multidimensional lower bound.

The idea is to use a standard adversary argument, using Lemma 3.1 to show that it is impossible to distinguish samples taken from a distribution from a particular ensemble, from those taken from the uniform distribution.

Proposition 3.2. If there exists an algorithm that given s independent samples from an unknown k-histogram, q, on [0, 1] and accepts with at least 2/3 probability if q = U and rejects with at least 2/3 probability if dTV (q, U) ≥ ϵ, then s = Ω(√k/ϵ2).

Proof. We assume that k is even. Divide [0, 1] into k/2 equally sized bins. Let P be a distribution over k histograms where in each bin either  dq = (1 + ϵ)dxon the first half and  dq = (1 − ϵ)dxon the second half of the bin, or visa versa independently for each bin. Note that a sample from P is always a k-histogram q with  dTV (q, U) = ϵ. Let P⊗s be the distribution on  [0, 1]s obtained by randomly picking a distribution q from P and then taking s independent samples from q.

Given that an algorithm to distinguish the uniform distribution from k-histograms far from it exists, such a distribution can distinguish a single sample from  P⊗s from a sample from  U ⊗s. Therefore, by Lemma 3.1, we must have that  χU⊗s(P⊗s, P⊗s) ≥ 4/3.We will now try to bound this quantity.

Note that  P⊗s is a mixture of the distributions  q⊗s where qis drawn from P. Therefore, by linearity of the  χ-metric, we have that

image

where the last equality is by noting that the corresponding integral decomposes as a product.

We now need to think about the distribution of  χU(p, q) when p and qare drawn independently from P. We note that for each bin B the quantity�BdpdqdU is either 1+ϵ2k/2 or 1−ϵ2k/2 with equal probability and independently for each bin. Therefore,

image

where  Xiare i.i.d. random variables  Xi ∈u {±1}. Therefore,

image

To bound this quantity, we use the fact that, for each  t, the tth moment of a Rademacher random variable is less than or equal to the corresponding moment of the standard Gaussian. We thus have that

image

where the  Giare i.i.d. N(0, 1) random variables. We can bound this latter quantity as follows:

image

or equivalently when  s = Ω(√k/ϵ2).This completes the proof of Proposition 3.2.

3.2 First Attempt: Basic Multidimensional Lower Bound

In this subsection, we build on the univariate construction of the previous subsection to obtain a slightly improved lower bound in d dimensions. We achieve this by modifying our ensemble in order to force any testing algorithm to guess the dimensions of the rectangles involved in the partition. Specifically, we prove the following:

Proposition 3.3. If there exists an algorithm that, given s independent samples from a k-histogram, q, on [0, 1]d with k > 4d, accepts with at least 2/3 probability if p = U and rejects with at least 2/3 probability if  dTV (p, U) ≥ ϵ, then s = Ω(ϵ−2�kd/2d log(log(k − d)/d)).

Proof. We first assume that k is a power of  2, namely k = 2m+d. Since this can always be achieved by decreasing k by a factor of at most 2, this should not affect the final bound. We define an ensemble P similarly to how we did so in the proof of Proposition 3.2. To define a distribution q in P, first we randomly and uniformly pick a  d-tuple (m1, m2, . . . , md)of non-negative integers summing to m. We call this the defining vector of q. We next divide  [0, 1]d into k/2bins by producing a �dj=1 2mjgrid We cut each bin into  2d equal sub-bins by diving it in half along each dimension. We divide these sub-bins into two classes based on their parity. We then let  dq = (1 + ϵ)dVon the sub-bins of a random parity and  dq = (1 − ϵ)dVon the other sub-bins, where the choices are independent for each bin. We note that a q drawn from P is always a k-histogram that is  ϵ-far from the uniform distribution U. An illustration is given in Figure 3.

image

Figure 3: An example of a distribution from P. The dark cells have density  1 + ϵ, and the light cells have density  1−ϵ. The green lines separate the square into a  4×2grid, and each rectangle is filled with a random 2 × 2checkerboard.

image

independent samples from q. Once again, it suffices to bound from below  χU⊗s(P⊗s, P⊗s). We similarly have that

image

We note that if p and q have the same defining vectors, then the contribution to  χU(p, q) from eachbin is randomly and independently  2(1 ± ϵ2)/k. Therefore, by the arguments of the previous subsection, if we condition on p and q having the same defining vectors, the expectation of  (χU(p, q))⊗s is at

image

most exp

�2�. On the other hand, if p and q have different defining vectors, we claim that

image

χU(p, q) = 1. In fact, we make the stronger claim that if A is the intersection of a defining bin of p and a defining bin of  q, then�AdpdqdU = q(A). This is because without loss of generality we may assume that  p’sassociated  m1is smaller than q’s associated  m1. This in turn means that given any point in A, the entire width of A along the first axis will be in the same sub-bin for q, but will pass through two sub-bins of opposite parity for p. Thus, the average of dp/dU over this line will be 1, and thus the integral over A of dpdp/dU is the same as the integral of dq.

Now since there are�m+d−1d−1 �different possible defining vectors, we have that

image

In order for this to be at least 4/3, it must be the case that

image

This completes the proof of Proposition 3.3.

image

Figure 4: An example of a probability distribution from ensemble Q. The square is divided into n = 4 regions by the black lines. Each sub-square is divided into a randomly sized grid of  2m = 8equal rectangles by the green lines. To get the final distribution, each of those rectangles should be filled with a random checkerboard as in Figure 3.

3.3 Second Attempt: Proof of Final Sample Lower Bound

Unfortunately, the lower bound of Proposition 3.3 only saves us a log log(k) factor. This is essentially because a testing algorithm only needs to correctly guess one of poly-logarithmically many defining vectors, and once it has guessed the correct one, it only needs to see a signal large enough that the probability of error is only inverse poly-logarithmic. This can be done by increasing the number of samples by only a doubly logarithmic factor. In order to do better, we will need a slightly more complicated construction, where we chop our domain into pieces and fill each piece with rectangles, but where different pieces might have rectangles of different sizes.

Theorem 3.4. If there exists an algorithm that, given s independent samples from a k-histogram,  q, on [0, 1]dwith  k > 2100d, accepts with at least 2/3 probability if q = U, and rejects with at least 2/3 probability if dTV (p, U) ≥ ϵ, then s = (√k/ϵ2) · Ω(log(k)/d)d−1.

Proof. We first assume that k can be written in the form  k = n2m+d, where n ≤�m+d−1d−1 �/4. We note that (perhaps decreasing k by a constant factor) we can achieve this with  n = Ω(log(k)/d)d, and therefore we can assume this throughout the rest of the argument.

We describe a new ensemble Q over k-histograms on  [0, 1]d in the following way: First, divide  [0, 1]dinto n equal volume boxes in some arbitrary way. For each box  Bi, pick a member  pi from P, the ensemble from the proof of Proposition 3.3, independently for different i. We let the restriction of  q to Bi be pirescaled such that it assigns  Bitotal mass 1/n, and such that the domain of definition is  Bi, rather than [0, 1]d. An example element of Q is illustrated in Figure 4.

Similarly, it suffices to show that if s is below our desired sample lower bound then

image

is less than 4/3.

We note that for p and q drawn from Q the quantity�BidpdqdU is distributed as  χU(p′, q′)/n with p′ andq′ drawn from P. This is 1/nexcept with probability  α := �m+d−1d−1 �−1and otherwise is distributed as 1/n + ϵ2n2m�2mj=1 Xij, where the  Xijare i.i.d.  {±1}random variables. Notice that these are independent for different i and sum to  χU(p, q). Therefore,

image

where the  Yiare i.i.d., equal to 1 with probability  α, and 0otherwise. Therefore, we have that

image

Once again, this expectation is only increased if the  Xijare replaced by standard Gaussians, and so  χU⊗s(Q⊗s, Q⊗s)is at most

image

with  Gii.i.d. standard normals. Noting that we still have a sum of �ni=1 Yi ∼ Binomial(n, α) manyindependent Gaussians, this simplifies to

image

In order for this to be at least 4/3, it must be the case that

image

or equivalently that

image

This completes the proof of Theorem 3.4.

Remark 3.5. We note that our lower bounds for uniformity testing of histograms on  [0, 1]d can be made to work for histograms on  [m]d, assuming that  m ≫ k. In particular, our lower bound construction requires first dividing our domain into n equal boxes, and then subdividing each of these boxes into k/n equal boxes in such a way that the number of subdivisions in each dimension is a power of 2. For simplicity, let us assume that n is a power of d. In that case, we can first cut each edge of our original box into  n1/d equalpieces and then further subdivide each side into k/n equal pieces. We note that all of the histograms in our adversarial family are consistent with this partition of our cube into fewer than  kd boxes. Therefore, by the inverse of the reduction above, our lower bound can be made to work on  [k]d rather than  [0, 1]d. A moreelaborate construction can show that our lower bounds apply for domain  [m]d for any m ≫ k.

In this work, we gave a computationally efficient and sample near-optimal algorithm for the problem of testing the identity of multidimensional histogram distributions in any fixed dimension. Our nearly matching upper and lower bounds have interesting consequences regarding the relation of learning and identity testing for this important nonparametric family of distributions.

A natural direction for future work is to generalize our results to the problem of testing equivalence between two unknown multidimensional histograms. The one-dimensional version of this problem was resolved in [DKN15a, DKN17]. Additional ideas are required for this setting, as the algorithm and analysis in this work exploit the a priori knowledge of the explicit distribution.

Another direction for future work concerns characterizing the sample and computationally complexity of identity testing d-dimensional k-histograms when the dimension d and the number of rectangles k are comparable, e.g., k = poly(d) or even k < d. We believe that understanding these parameter regimes requires different ideas.

[ADH+15]J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms. In 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015, pages 249–263, 2015.

[ADK15] J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems (NIPS), pages 3591–3599, 2015.

[ADLS16] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Fast algorithms for segmented regression. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pages 2878–2886, 2016.

[ADLS17] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 1278–1289, 2017. Full version available at https://arxiv.org/abs/1506.00671.

[BBBB72] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistical Inference under Order Restrictions. Wiley, New York, 1972.

[BC17] T. Batu and C. L. Canonne. Generalized uniformity testing. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 880–889, 2017.

[BCG01] N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: A multidimensional workload-aware histogram. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 211–222, 2001.

[BFR+00]T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Science, pages 259–269, 2000.

[BKR04] T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In ACM Symposium on Theory of Computing, pages 381–390, 2004.

[Can15] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.

[Can16] C. L. Canonne. Are few bins enough: Testing histogram distributions. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, pages 455–463, 2016.

[CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, pages 25:1–25:14, 2016.

[CDKS17a] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing bayesian networks. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 370–448, 2017.

[CDKS17b] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing conditional independence of discrete distributions. CoRR, abs/1711.11560, 2017. In STOC’18.

[CDS17] C. L. Canonne, I. Diakonikolas, and A. Stewart. Fourier-based testing for families of distributions. CoRR, abs/1706.05738, 2017. In NeurIPS 2018.

[CDSS13] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions over discrete domains. In SODA, pages 1380–1394, 2013.

[CDSS14a] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Efficient density estimation via piecewise polynomial approximation. In STOC, pages 604–613, 2014.

[CDSS14b] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in near- linear time using variable-width histograms. In NIPS, pages 1844–1852, 2014.

[CDVV14] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193–1203, 2014.

[CGHJ12] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4:1–294, 2012.

[CMN98] S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD Conference, pages 436–447, 1998.

[DDK18] C. Daskalakis, N. Dikkala, and G. Kamath. Testing Ising models. In SODA, 2018.

[DDS12] C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning k-modal distributions via testing. In SODA, pages 1371–1385, 2012.

[DDS+13]C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-modal distributions: Optimal algorithms via reductions. In SODA, pages 1833–1852, 2013.

[DGPP16] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are optimal for uniformity and closeness. Electronic Colloquium on Computational Complexity (ECCC), 23:178, 2016.

[DGPP17] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing with high probability. CoRR, abs/1708.02728, 2017. In ICALP 2018.

[DHS15] I. Diakonikolas, M. Hardt, and L. Schmidt. Differentially private learning of structured discrete distributions. In NIPS, pages 2566–2574, 2015.

[DK16] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685–694, 2016. Full version available at abs/1601.05557.

[DKN15a] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 1183–1202, 2015.

[DKN15b] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 1841–1854, 2015.

[DKN17] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Near-optimal closeness testing of discrete histogram distributions. In 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, pages 8:1–8:15, 2017.

[DKS16a] I. Diakonikolas, D. M. Kane, and A. Stewart. Efficient robust proper learning of log-concave distributions. CoRR, abs/1606.03077, 2016.

[DKS16b] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. CoRR, abs/1611.03473, 2016. In Proceedings of FOCS’17.

[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Sharp bounds for generalized uniformity testing. In Advances in Neural Information Processing Systems 31, NeurIPS 2018, pages 6204–6213, 2018.

[DL04] L. Devroye and G. Lugosi. Bin width selection in multivariate histograms by the combinatorial method. Test, 13(1):129–145, 2004.

[DLS18] I. Diakonikolas, J. Li, and L. Schmidt. Fast and sample near-optimal algorithms for learning multidimensional histograms. In S´ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 819–842. PMLR, 2018.

[DP17] C. Daskalakis and Q. Pan. Square Hellinger subadditivity for Bayesian networks and its applications to identity testing. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 697–703, 2017.

[FD81] D. Freedman and P. Diaconis. On the histogram as a density estimator:l2 theory. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4):453–476, 1981.

[GGI+02]A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, smallspace algorithms for approximate histogram maintenance. In STOC, pages 389–398, 2002.

[GGR98] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45:653–750, 1998.

[GJ14] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics. Cambridge University Press, 2014.

[GKS06] S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst., 31(1):396–438, 2006.

[GKTD00] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multidimensional aggregate range queries over real attributes. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 463–474, 2000.

[Gol17] O. Goldreich. Introduction to Property Testing. Forthcoming, 2017.

[Gre56] U. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125–153, 1956.

[ILR12] P. Indyk, R. Levi, and R. Rubinfeld. Approximating and Testing k-Histogram Distributions in Sub-linear Time. In PODS, pages 15–22, 2012.

[JKM+98]H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, pages 275–286, 1998.

[Kle09] J. Klemela. Multivariate histograms with data-dependent partitions. Statistica Sinica, 19(1):159–176, 2009.

[LN96] G. Lugosi and A. Nobel. Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist., 24(2):687–706, 04 1996.

[Mut05] S. Muthukrishnan. Data streams: Algorithms and applications. Found. Trends Theor. Comput. Sci., 1(2):117–236, 2005.

[NP33] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933.

[Pan08] L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54:4750–4755, 2008.

[Pea00] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302):157–175, 1900.

[PI97] V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, pages 486–495, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

[RS96] R. Rubinfeld and M. Sudan. Robust characterizations of polynomials with applications to program testing. SIAM J. on Comput., 25:252–271, 1996.

[Rub12] R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24–28, 2012.

[Sco79] D. W. Scott. On optimal and data-based histograms. Biometrika, 66(3):605–610, 1979.

[Sco92] D.W. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York, 1992.

[TGIK02] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD Conference, pages 428–439, 2002.

[VV14] G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.

[WN07] R. Willett and R. D. Nowak. Multiscale poisson intensity and density estimation. IEEE Transactions on Information Theory, 53(9):3171–3187, 2007.

We use the flattening method developed in [DK16]. We begin by giving the definition of a split distribution from that work:

Definition A.1. Given a distribution p on [n] and a multiset S of elements of [n], define the split distribution pS on [n + |S|]as follows: For  1 ≤ i ≤ n, let ai denote 1plus the number of elements of S that are equal to  i. Thus, �ni=1 ai = n + |S|.We can therefore associate the elements of [n + |S|] to elements of the set B = {(i, j) : i ∈ [n], 1 ≤ j ≤ ai}. We now define a distribution  pSwith support B, by letting a random sample from  pSbe given by (i, j), where i is drawn randomly from p and j is drawn randomly from  [ai].

We recall a basic fact about split distributions:

Fact A.2 (Fact 2.5, [DK16]). Let p and q be probability distributions on [n], and S be a given multiset of [n]. Then: (i) We can simulate a sample from  pS or qSby taking a single sample from p or q, respectively. (ii) It holds  ∥pS − qS∥1 = ∥p − q∥1.

image

Lemma A.3 ([CDVV14]). Let p and q be two unknown distributions on [n]. There exists an algorithm that on input  n, b ≥ min{∥p∥2, ∥q∥2} and 0 < ϵ <√2b, draws O(b/ϵ2)samples from each of p and q and, with probability at least 2/3, distinguishes between the cases that  p = q and ∥p − q∥2 > ϵ.

We now have all the necessary tools to describe and analyze our  ℓk1-identity tester. The pseudo-code of our algorithm follows:

image

We now provide the simple analysis. Note that  |S| ≤ �ni=1 kpi = k and that pSassigns probability mass at most 1/k to each domain element. Therefore, we have that  ∥pS∥2 ≤ 1/√k.By Lemma A.3 — applied for  b = 1/√k and ϵ/√2kin place of  ϵ— we obtain that the  ℓ2-tester in Step 2 of the above pseudo-code requires  O(b/ϵ2) = O(√k/ϵ2)samples from  qS and pS. Since pis explicitly given, so is  pSand therefore we can straightforwardly generate samples from  pSfor free. By Fact A.2, we can generate a sample from  qSgiven a sample from q. Hence, our algorithm uses  O(√k/ϵ2)samples from q. This completes the analysis of the sample complexity.

We now prove correctness. If p = q, then by Fact A.2 we have that  pS = qSand the algorithm will return “YES” with appropriate probability. On the other hand, if  ∥q − p∥1,k ≥ ϵ, then by definition of the  ℓk1metric it follows that  ∥pS − qS∥1,k+m ≥ ϵ, for m def= |S|. Since k + melements contribute to total  ℓ1-errorat least  ϵ, by the Cauchy-Schwarz inequality, we have that

image

where we used the fact that  m = |S| ≤ k. Therefore, in this case, the algorithm returns “NO” with appropriate probability. This completes the proof of Theorem 2.3.


Designed for Accessibility and to further Open Science