b

DiscoverSearch
About
My stuff
Graph-based Discriminators: Sample Complexity and Expressiveness
2019·arXiv
Abstract
Abstract

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. Classically, we use a hypothesis class H and claim that the two distributions are distinct if for some  h ∈ Hthe expected value on the two distributions is (significantly) different.

Our starting point is the following fundamental problem: ”is having the hypothesis dependent on more than a single random example bene-ficial”. To address this challenge we define k-ary based discriminators, which have a family of Boolean k-ary functions G. Each function  g ∈ Gnaturally defines a hyper-graph, indicating whether a given hyper-edge exists. A function  g ∈ Gdistinguishes between two distributions, if the expected value of g, on a k-tuple of i.i.d examples, on the two distributions is (significantly) different.

We study the expressiveness of families of k-ary functions, compared to the classical hypothesis class H, which is k = 1. We show a separation in expressiveness of k + 1-ary versus k-ary functions. This demonstrate the great benefit of having  k ≥2 as distinguishers.

For  k ≥2 we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. We proceed and provide upper and lower bounds as a function of our extended notion of VCdimension.

The task of discrimination consists of a discriminator that receives finite samples from two distributions, say  p1 and p2, and needs to certify whether the two distributions are distinct. Discrimination has a central role within the framework of Generative Adversarial Networks [12], where a discriminator trains a neural net to distinguish between samples from a real-life distribution and samples generated synthetically by another neural network, called a generator.

A possible formal setup for discrimination identifies the discriminator with some distinguishing class  D = {f : X → R} of distinguishing functions.In turn, the discriminator wishes to find the best  d ∈ Dthat distinguishes between the two distributions. Formally, she wishes to find  d ∈ D such that1

image

For examples, in GANs, the class of distinguishing functions we will consider could be the class of neural networks trained by the discriminator.

The first term in the RHS of eq. (1) is often referred to as the Integral Probability Metric (IPM distance) w.r.t a class D [17], denoted IPMD. Assuch, we can think of the discriminator as computing the IPMD distance.

Whether two, given, distributions can be distinguished by the discriminator becomes, in the IPM setup, a property of the distinguishing class. Also, the number of examples needed to be observed will depend on the class in question. Thus, if we take a large expressive class of distinguishers, the discriminator can potentially distinguish between any two distributions that are far in total variation. In that extreme, though, the class of distinguishers would need to be very large and in turn, the number of samples needed to be observed scales accordingly. One could also choose a “small” class, but at a cost of smaller distinguishing power that yields smaller IPM distance.

For example, consider two distributions over [n] to be distinguished. We could choose as a distinguishing class the class of all possible subsets over n. This distinguishing class give rise to the total variation distance, but the sample complexity turns out to be O(n). Alternatively we can consider the class of singletones: This class will induce a simple IPM distance, with graceful sample complexity, however in worst case the IPM distance can be as small as O(1/n) even though the total variation distance is large.

Thus, IPM framework initiates a study of generalization complexity where we wish to understand what is the expressive power of each class and what is its sample complexity.

For this special case that D consists of Boolean functions, the problem turns out to be closely related to the classical statistical learning setting and prediction [22]. The sample complexity (i.e., number of samples needed to be observed by the discriminator) is governed by a combinatorial measure termed VC dimension. Specifically, for the discriminator to be able to find a d as in eq. (1), she needs to observe order of Θ( ρǫ2) examples, where  ρ isthe VC dimension of the class D [5, 22].

In this work we consider a natural extension of this framework to more sophisticated discriminators: For example, consider a discriminator that observes pairs of points from the distribution and checks for collisions – such a distinguisher cannot apriori be modeled as a test of Boolean functions, as the tester measures a relation between two points and not a property of a single point. The collision test has indeed been used, in the context of synthetic data generation, to evaluate the diversity of the synthetic distribution [2].

More generally, suppose we have a class of 2-ary Boolean functions: G = {g : g(x1, x2) → {0, 1}}and the discriminator wishes to (approximately) compute

image

Here  p2 denotes the product distribution over p. More generally, we may consider k-ary mappings, but for the sake of clarity, we will restrict our attention in this introduction to k = 2. Such 2-ary Boolean mapping can be considered as graphs where  g(x1, x2) =1 symbolizes that there exists an edge between  x1 and x2and similarly g(x1, x2) = 0 denotes that there is no such edge. The collision test, for example, is modelled by a graph that contains only self–loops. We thus call such multi-ary statistical tests graph-based distinguishers. Two natural question then arise

1. Do graph–based discriminators have any added distinguishing power over classical discriminators?

2. What is the sample complexity of graph–based discriminators?

With respect to the first question we give an affirmative answer and we show a separation between the distinguishing power of graph–based discriminators and classical discriminators. As to the second question, we introduce a new combinatorial measure (termed graph VC dimension) that governs the

sample complexity of graph–based discriminators – analogously to the VC characterization of the sample complexity of classical discriminators. We next elaborate on each of these two results. As to the distinguishing power of graph–based discriminators, we give an affirmative answer in the following sense: We show that there exists a single graph g such that, for any distinguishing class D with bounded VC dimension, and  ǫ, there are two distributions  p1 and p2 that are D–indistinguishable but g certifies that  p1 and p2are distinct. Namely, the quantity in eq. (2) is at least 1/4 for G = {g}. This result may be surprising. It is indeed known that for any two distributions that are  ǫ–far in total variation, there exists a boolean mapping d that distinguishes between the two distributions. In that sense, distinguishing classes are known to be universal. Thus, asymptotically, with enough samples any two distribution can be ultimately distinguished via a standard distinguishing function. Nevertheless, our result shows that, given finite data, the restriction to classes with finite capacity is limiting, and there could be graph-based distinguishing functions whose distinguishing power is not comparable to any class with finite capacity. We stress that the same graph competes with all finite–capacity classes, irrespective of their VC dimension. With respect to the second question, we introduce a new VC-like notion termed graph VC dimension that extends naturally to graphs (and hypergraphs). On a high level, we show that for a class of graph-based distinguishers with graph VC dimension  ρ, O(ρ) examples are sufficient for discrimination and that Ω(√ρ) examples are necessary. This leaves a gap of factor √ρwhich we leave as an open question. The notion we introduce is strictly weaker than the standard VC–dimension of families of multi-ary functions, and the proofs we provide do not follow directly from classical results on learnability of finite VC classes [22, 5]. In more details, a graph-based distinguishing class G is a family of Boolean functions over the product space of vertices  V: G ⊆ {0, 1}V2. As such it is equipped with a VC dimension, the largest set of pairs of vertices that is shattered by G. It is not hard to show that finite VC is sufficient to achieve finite sample complexity bounds over 2-ary functions [9]. It turns out, though, that it is not a necessary condition: For example, one can show that the class of k-regular graphs has finite graph VC dimension but infinite VC dimension. Thus, even though they are not learnable in the standard PAC setting, they have finite sample complexity within the framework of discrimination. The reason for this gap, between learnability and discriminability, is that

learning requires uniform convergence with respect to any possible distribution over pairs, while discrimination requires uniform convergence only with respect to product distributions – formally then, it is a weaker task, and, potentially, can be performed even for classes with infinite VC dimension.

1.1 Related Work

The task of discrimination has been considered as early as the work of Vapnik and Chervonenkis in [22]. In fact, even though Vapnik and Chervonenkis original work is often referred in the context of prediction, the original work considered the question of when the empirical frequency of Boolean functions converges uniformly to the true probability over a class of functions. In that sense, this work can be considered as a natural extension to k-ary functions and generalization of the notion of VC dimension.

The work of [9, 8] studies also a generalization of VC theory to multi-ary functions in the context of ranking tasks and U-statistics. They study the standard notion of VC dimension. Specifically they consider the function class as Boolean functions over multi-tuples and the VC dimension is defined by the largest set of multi-tuples that can be shattered. Their work provides several interesting fast-rate convergence guarantees. As discussed in the introduction, our notion of capacity is weaker, and in general the results are incomparable.

GANs A more recent interest in discrimination tasks is motivated by the framework of GANs, where a neural network is trained to distinguish between two sets of data – one is real and the other is generated by another neural network called generator. Multi-ary tests have been proposed to assess the quality of GANs networks. [2] suggests birthday paradox to evaluate diversity in GANs. [19] uses Binning to assess the solution proposed by GANs.

Closer to this work [15] suggests the use of a discriminator that observes samples from the m-th product distribution. Motivated by the problem of mode collapse they suggest a theoretical framework in which they study the algorithmic benefits of such discriminators and observe that they can significantly reduce mode collapse. In contrast, our work is less concerned with the problem of mode collapse directly and we ask in general if we can boost the distinguishing power of discriminators via multi-ary discrimination. Moreover, we provide several novel sample complexity bounds.

Property Testing A related problem to ours is that of testing closeness of distributions [3, 11]. Traditionally, testing closeness of distribution is concerned with evaluating if two discrete distributions are close vs. far/identical in total variation. [11], motivated by graph expansion test, propose a collision test to verify if a certain distribution is close to uniform. Interestingly, a collision test is a graph-based discriminator which turns out to be optimal for the setting[18]. Our sample–complexity lower bounds are derived from these results. Specifically we reduce discrimination to testing uniformity [18]. Other lower bounds in the literature can be similarly used to achieve alternative (yet incomparable bounds) (e.g. [7] provides a Ω(n2/3/ǫ3/4) lowerbounds for testing whether two distributions are far or close).

In contrast with the aforementioned setup, here we do not measure distance between distributions in terms of total variation but in terms of an IPM distance induced by a class of distinguishers. The advantage of the IPM distance is that it sometimes can be estimated with limited amount of samples, while the total variation distance scales with the size of the support, which is often too large to allow estimation.

Several works do study the question of distinguishing between two distributions w.r.t a finite capacity class of tests, Specifically the work of [14] studies refutation algorithms that distinguish between noisy labels and labels that correlate with a bounded hypothesis class. [21] studies a closely related question in the context of realizable PAC learning. A graph-based discriminator can be directly turned to a refutation algorithm, and both works of [14, 21] show reductions from refutation to learning. In turn, the agnostic bounds of [14] can be harnessed to achieve lower bounds for graph-based discrimination. Unfortunately this approach leads to suboptimal lower bounds. It would be interesting to see if one can improve the guarantees for such reductions, and in turn exploit it for our setting.

2.1 Basic Notations – Graphs and HyperGraphs

Recall that a k-hypergraph g consists of a a set  Vg of verticesand a collection of non empty k–tuples over  V: Eg ⊆ Vk, which are referred to as hyperedges. If k = 2 then g is called a graph. 1–hypergraphs are simply identified as subsets over V. We will normally use d to denote such 1-hypergraphs and will refer to them as distinguishers. A distinguisher d can be identified with a Boolean function according to the rule:  d(x) = 1 iff x ∈ Ed.

Similarly we can identify a k-hypergraph with a function  g : Vk → {0, 1}.

Namely, for any graph g we identify it with the Boolean function

image

We will further simplify and assume that g is undirected, this means that for any permutation  π : [k] → [k], we have that

image

We will call undirected k-hypergraphs, k-distinguishers. A collection of k-distinguishers over a common set of vertices V will be referred to as a k-distinguishing class. If k = 1 we will simply call such a collection a distinguishing class. For k > 1 we will normally denote such a collection with G and for k = 1 we will often use the letter D.

Next, given a distribution P over vertices and a k–hypergraph g let us denote as follows the frequency of an edge w.r.t P:

image

where we use the notation  v1:tin shorthand for the sequence (v1, . . . , vt) ∈Vt, and P k denotes the product distribution of P k times.

Similarly, given a sample  S = {vi}mi=1 we denote the empirical frequency of an edge:

image

As a final set of notations: Given a k-hypergraph g a sequence  v1:n wheren < k, we define a  k − n–distinguisher  gv1:nas follows:

image

In turn, we define the following distinguishing classes: For every sequence v1:n, n < k, the distinguishing class  Gv1:nis defined as follows:

image

Finally, we point out that we will mainly be concerned with the case that  |V| ≤ ∞ or V = N. However, all the results here can be easily extended to other domains as long as certain (natural) measurability assumptions are given to ensure that VC theory holds (see [22, 4]).

2.2 IPM distance

Given a class of distinguishers D the induced IPM distance [17], denoted by IPMD, is a (pseudo)–metric between distributions over V defined as follows

image

The definition can naturally be extended to a general family of graphs, and we define:

image

Another metric we would care about is the total variation metric. Given two distributions  p1 and p2the total variation distance is defined as:

image

where  E ⊆ V{0,1} goes over all measurable events.

In contrast with an IPM distance, the total variation metric is indeed a

metric and any two distributions  p1 ̸= p2we have that TV(p1, p2) > 0. In

fact, for every distinguishing class  D, IPMD ⪯ TV.2

For finite classes of vertices V, it is known that the total variation metric

image

Further, if we let D = P(V) the power set of V we obtain

image

2.3 Discriminating Algorithms

Definition 1. Given a distinguishing class G a G-discriminating algorithm A with sample complexity  m(ǫ, δ)is an algorithm that receives as input two finite samples  S = (S1, S2)of vertices and outputs a hyper-graph  gAS ∈ Gsuch that:

If  S1, S2are drawn IID from some unknown distributions  p1, p2 respec-tively and  |S1|, |S2| > m(ǫ, δ) then w.p. (1 − δ)the algorithm’s output satis-fies:

image

The sample complexity of a class G is then given by the smallest possible sample complexity of a G-discriminating algorithm A.

image

Namely there exists a discriminating algorithm for G with sample complexity m(ǫ, δ) < ∞.

VC classes are discriminable For the case k = 1, discrimination is closely related to PAC learning. It is easy to see that a proper learning algorithm for a class D can be turned into a discriminator: Indeed, given access to samples from two distributions  p1 and p2we can provide a learner with labelled examples from a distribution p defined as follows: p(y = 1) = p(y = −1) = 12 and p(·|y = 1) = p1, and p(·|y = −1) = p2. Given access to samples from  p1 and p2we can clearly generate IID samples from the distribution p. If, in turn, we provide a learner with samples from p and it outputs a hypothesis  d ∈ Dwe have that (w.h.p):

image

One can also see that a converse relation holds, if we restrict our attention to learning balanced labels (i.e.,  p(y = 1) = p(y = −1)). Namely, given labelled examples from some balanced distribution, the output of a discriminator is a predictor that competes with the class of predictors induced by D.

Overall, the above calculation, together with Vapnik and Chervonenkis’s classical result [22] shows that classes with finite VC dimension  ρare discriminable with sample complexity  O( ρǫ2).3The necessity of finite VC dimension for agnostic PAC-learning was shown in [1]. Basically the same argument shows that given a class  D, ˜Ω( ρǫ2) examples are necessary for discrimination. We next introduce a natural extension of VC dimension to hypergraphs, which will play a similar role.

2.4 VC Dimension of hypergraphs

We next define the notion of graph VC dimension for hypergraphs, as we will later see this notion indeed characterizes the sample complexity of discriminating classes, and in that sense it is a natural extension of the notion of VC dimension for hypotheses classes:

Definition 2. Given a family of k-hypergraphs, G: The graph VC dimension of the class G, denoted gVC(G), is defined inductively as follows: For k = 1 gVC(G) is the standard notion of VC dimension, i.e., gVC(G) = VC(G). For k > 1:

image

Roughly, the graph VC dimension of a hypergraph is given by the VC dimension of the induced classes of distinguishers via projections. Namely, we can think of the VC dimension of hypergraphs as the projected VC dimension when we fix all coordinates in an edge except for one.

We next describe the main results of this work. The results are divided into two sections: For the first part we characterize the sample complexity of graph–based distinguishing class. The second part is concerned with the expressive/distinguishing power of graph–based discriminators. All proofs are provided in appendices B and C respectively.

3.1 The sample complexity of graph-based distinguishing class

We begin by providing upper bounds to the sample complexity for discrim- ination

Theorem 1 (Sample Complexity – Upper Bound). Let G be a k–distinguishing class with gVC(G) = ρ then Ghas sample complexity  O(ρk2ǫ2 log 1/δ).

Theorem 1 is a corollary of the following uniform convergence upper bound for graph-based distinguishing classes.

Theorem 2 (uniform convergence). Let G be a k–distinguishing class with gVC(G) = ρ. Let S = {vi}mi=1 be an IID sample of vertices drawn from some unknown distribution  P. If m = Ω(ρk2ǫ2 log 1/δ)then with probability at least (1  − δ)(over the randomness of S):

image

The proof of theorem 2 is given in appendix B.1. We next provide a lower bound for the sample complexity of discriminating algorithms in terms of the graph VC dimension of the class

image

We refer the reader to appendix B.2 for a proof of theorem 3. Our upper bounds and lower bounds leave a gap of order  O(√ρ). As dicussed in section 2.3, for the case k = 1 we can provide a tight  θ( ρǫ2) bound through a reduction to agnostic PAC learning and the appropriate lower bounds[1]. In general it would be interesting to improve the above bound both in terms of  ρ and k.

3.2 The expressive power of graph-based distinguishing class

So far we have characterized the discriminability of graph-based distinguishing classes. It is natural though to ask if graph–based distinguishing classes add any advantage over standard 1-distinguishing classes. In this section we provide several results that show that indeed graph provide extra expressive power over standard distinguishing classes.

We begin by providing a result over infinite graphs. (proof is provided in appendix C.1)

Theorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity  m(ǫ, δ) = O(log 1/δǫ2 ) (in fact |G| = 1) such that: for any 1-distinguishing class D with finite VC dimension, and every  ǫ > 0 there aretwo distributions  p1, p2 such that IPMD(p1, p2) < ǫ but IPMG(p1, p2) > 1/2

Theorem 4 can be generalized to higher order distinguishing classes (see appendix C.2 for a proof):

Theorem 5. Let V = N. There exists a k-distinguishing class  Gk, with sample complexity  m(ǫ, δ) = O(k2+log 1/δǫ2 )such that: For any  k−1-distinguishing class  Gk−1with bounded sample complexity, and every  ǫ > 0there are two distributions  p1, p2 such that IPMGk−1(p1, p2) < ǫ and IPMGk(p1, p2) > 1/4.

Finite Graphs We next study the expressive power of distinguishing graphs over finite domains.

It is known that, over a finite domain V = {1, . . . , n}, we can learn with a sample complexity of  O( nǫ2 log 1/δ) any distinguishing class. In fact, we can learn the total variation metric (indeed the sample complexity of P(V) is bounded by log |P(V )| = n).

Therefore if we allow classes whose sample complexity scales linearly with n we cannot hope to show any advantage for distinguishing graphs. However, in most natural problems n is considered to be very large (for example, over the Boolean cube n is exponential in the dimension). We thus, in general, would like to study classes that have better complexity in terms of n. In that sense, we can show that indeed distinguishing graphs yield extra expressive power.

In particular, we show that for classes with sublogarithmic sample complexity, we can construct graphs that are incomparable with a higher order distinguishing class.

Theorem 6. Let |V| = n. There exists a k-distinguishing class  Gk, withsample complexity  m(ǫ, δ) = O(k2+log 1/δǫ2 ) (in fact |G| = 1) such that: For any  ǫ > 0 and any k − 1distinguishing class  Gk−1 if:

image

then gVC(Gk−1) = Ω( ǫ2k2√log n).

The proof is given in appendix C.3. We can improve the bound in theorem 6 for the case k = 1 (see appendix C.4 for proof).

Theorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity  m(ǫ, δ) = O(log 1/δǫ2 ) (in fact |G| = 1) such that: For any ǫ > 0and any distinguishing class D if:

image

then gVC(D) = ˜Ω(ǫ2 log n).

In this work we developed a generalization of the standard framework of discrimination to graph-based distinguishers that discriminate between two distributions by considering multi-ary tests. Several open question arise from our results:

Improving Sample Complexity Bounds In terms of sample complexity, while we give a natural upper bound of  O(ρk2), the lower bound we provide are not tight neither in d nor in k and we provide a lower bound of Ω(

of k.

Improving Expressiveness Bounds We also showed that, over finite domains, we can construct a graph that is incomparable with any class with VC dimension Ω(ǫ2 log n). The best upper bound we can provide (the VC of a class that competes with any graph) is the naive O(n) which is the VC dimension of the total variation metric.

Additionally, for the k-hypergraph case, our bounds deteriorate to a Ω(ǫ2√log n). The improvement in the graph case follows from using an argument in the spirit of Boosting [10] and Hardcore Lemma [13] to construct two indistinguishable probabilities with distinct support over a small domain. It would be interesting to extend these techniques in order to achieve similar bounds for the k > 2 case.

Relation to GANs and Extension to Online Setting Finally, a central motivation for learning the sample complexity of discriminators is in the context of GANs. It then raises interesting questions as to the foolability of graph-based distinguishers.

The work of [6] suggests a framework for studying sequential games between generators and discriminators (GAM-Fooling). In a nutshell, the GAM setting considers a sequential game between a generator G that outputs distributions and a discriminator D that has access to data from some distribution  p∗ (not known to G). At each round of the game, the generator proposes a distribution and the discriminator outputs a  d ∈ Dwhich distinguishes between the distribution of G and the true distribution  p∗. Theclass D is said to be GAM-Foolable if the generator outputs after finitely many rounds a distribution p that is D–indistinguishable from  p∗

[6] showed that a class D is GAM–foolable if and only if it has finite Littlestone dimension. We then ask, similarly, which classes of graph–based distinguishers are GAM-Foolable? A characterization of such classes can potentially lead to a natural extension of the Littlestone notion and on-line prediction, to graph-based classes analogously to this work w.r.t VC dimension

Acknowledgements The authors would like to thank Shay Moran for helpful discussions and suggesting simplifications for the proofs of theorems 4 to 6.

[1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.

[2] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.

[3] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 259–269. IEEE, 2000.

[4] Shai Ben-David. 2 notes on classes with vapnik-chervonenkis dimension 1. arXiv preprint arXiv:1507.05307, 2015.

[5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.

[6] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models for fooling discriminators. arXiv preprint arXiv:1902.03468, 2019.

[7] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193–1203. SIAM, 2014.

[8] St´ephan Cl´emen¸con, Igor Colin, and Aur´elien Bellet. Scaling-up em- pirical risk minimization: optimization of incomplete u-statistics. The Journal of Machine Learning Research, 17(1):2682–2717, 2016.

[9] St´ephan Cl´emen¸con, G´abor Lugosi, Nicolas Vayatis, et al. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844–874, 2008.

[10] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In COLT, volume 96, pages 325–332. Citeseer, 1996.

[11] Oded Goldreich and Dana Ron. On testing expansion in boundeddegree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68–75. Springer, 2011.

[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[13] Russell Impagliazzo. Hard-core distributions for somewhat hard prob- lems. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 538–545. IEEE, 1995.

[14] Pravesh K Kothari and Roi Livni. Agnostic learning by refuting. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[15] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1498–1507, 2018.

[16] Richard J Lipton and Neal E Young. Simple strategies for large zero- sum games with applications to complexity theory. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 734–740. ACM, 1994.

[17] Alfred M¨uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.

[18] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.

[19] Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pages 5847–5858, 2018.

[20] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[21] Salil P. Vadhan. On learning vs. refutation. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1835–1848, 2017.

[22] Vladimir N Vapnik and Aleksei Yakovlevich Chervonenkis. The uni- form convergence of frequencies of the appearance of events to their probabilities. In Doklady Akademii Nauk, volume 181, pages 781–783. Russian Academy of Sciences, 1968.

image

We begin with a brief overview of some classical results in Statistical Learning theory which characterizes VC classes. Throughout we assume a domain X and a hypothesis class which is a family of Boolean functions over X: H ⊆ {0, 1}X .

Theorem 8. [Within proof of Thm. 6.11 in [20]] Let H be a class with VC dimension  ρ then

image

Recall that a class H has the uniform convergence property, if for some m : (0, 1)2 → N if Pis some unknown distribution and  S = {xi}mi=1 is asample drawn IID from  P such that |S| > m(ǫ, δ) then w.p. (1  − δ) (overthe sample S):

image

The following, high probability analogue of theorem 8, is also an immediate corollary of Theorem 6.8 in [20]4:

Corollary 1. [Within Thm 6.8 [20]] Let D be a class with VC dimension ρ. There exists a constant C > 0, such that: Let p be a distribution with finite support over V. Let S be an IID sequence

of m elements drawn from p, and denote by  pSthe empirical distribution over  S. If m ≥ C ρ+log 1/δǫ2 then w.p. (1 − δ)(over the random choice of S) we have that

image

The problem of testing the closeness of two discrete distributions can be phrased as follows: Given samples from two distributions  p1 and p2 thetester needs to distinguish between the case  p1 = p2and the case that ∥p1−p2∥1 ≥ ǫ. We will rely on the following result which follows immediately from a uniformity test lower bound due to [18].

Theorem 9. Given ǫ > 0and access to samples from distributions  p1 andp2 over [n]any algorithm that returns with probability 2/3 EQUIV ALENT

if  p1 = p2and returns  DISTINCT if ∥p1 − p2∥1 > ǫmust observe at least �√n/ǫ2}�samples.

We note that [7] gives a slightly better lower bound, of an order of Ω�max(n3/4/ǫ4/3, √n/ǫ2)�. However, to simplify we will focus on rates of O(1/ǫ2) that scale quadratically in  ǫ.

image

Theorem 2 (uniform convergence). Let G be a k–distinguishing class with gVC(G) = ρ. Let S = {vi}mi=1 be an IID sample of vertices drawn from some unknown distribution  P. If m = Ω(ρk2ǫ2 log 1/δ)then with probability at least (1  − δ)(over the randomness of S):

image

Fix a k–distinguishing class G with graph VC dimension  ρ. As in the standard proof of uniform convergence for VC classes, we first prove the statement in expectation and then apply Mcdiarmid’s inequality to prove the result w.h.p. Specifically, we will use the following Lemma (whose proof is given in appendix B.1.1):

Lemma 1 (Uniform Convergence in Expectation). Let G be a k–distinguishing class with gVC(G) = ρ. Let S = {vi}mi=1 be an IID sample of vertices drawn from some unknown distribution P. Then,

image

We next proceed with the proof of theorem 2, assuming the correctness of lemma 1. Define

image

Let  S = (v1, . . . , vm) be a sample and  S′, some sequence that differ from S only in the i-th vertex then we will show that:

image

Once we show eq. (4) holds, the result indeed follow from Mcdiarmid’s inequality and lemma 1. Specifically if we assume that  m ≥ 8k2(4+ρ log(2em/ρ)ǫ2 +

image

Applying Mcdiarmid’s we obtain that with probability at least (1  − e− mǫ28k2 ),over the sample S:

image

Noting that  m > 8k2 log 1/δǫ2, we obtain that with probability at least (1  − δ)

image

We are thus left with proving that eq. (4) holds.

For an index  i and m ≥ i, let us denote by  πi,m all k-subsets of indices from {1, . . . , m} that include i and we let  π¬i,m be all k-sequences that do not include i. Given a set  S of size m let Si,+ all the k-subsets of S that include  vi and let Si,−be all the k-subsets that do not include  vi. Next,denote

image

And similarly

image

Then, let  S and S′ be two samples that differ on the i-th example. Specifi-cally assume that  vi ∈ S and v′i ∈ S′. Note that  Si,− = S′i,−. Then:

image

We are thus left with proving lemma 1:

image

The proof of the statement follows by induction. The case k = 1 is the standard uniform convergence property of VC classes, and it follows from theorem 8.

We next proceed to prove the statement for k, assuming it holds for  k−1.

We begin with the following, triangular, inequality:

image

We next bound the two terms

image

where we denoted by  USk−1the uniform distribution over  k −1-tuples from S. The expectation in the last expression is thus taken w.r.t a process where we pick m elements according to P and then partition them to  m − k + 1elements and to a sequence  v1:k−1of distinct elements. This process is equivalent to simply choosing  m − k+ 1 elements according to P, and then picking  k −1 new elements, again, according to P as follows:

image

Note that the quantity 1m� d(vi) is dependent on  Gv1:k−1, namely these are random sampled choices that depend on our choice of distinguishing class. To bound their effect we next add and subtract auxiliary random variables u1, . . . , uk−1sampled IID according to P:

image

Renaming  u1, . . . , uk−1 and v1, . . . , vk−1we can write:

image

Finally we apply. theorem 8. Recalling that gVC(Du1:k−1) = ρ, and thatthe sequence S is drawn IID independent of the choice  u1:k−1, we obtain for every fixed (u1, . . . , uk)

image

We now use the induction hypothesis: Note that  Gv is (k −1)-distinguishing class with gVC(Gv) ≤ ρfor every choice of v. Thus, fixing v:

image

Continuing the proof With the aforementioned bound on the terms * and ** we now obtain

+  ∗∗ ≤ 4 +�ρ log 2em/ρ√2m + 2km +(k − 1)�4 +�ρ log(2em/ρ)�

image

image

To prove theorem 3 we will in fact prove a stronger statement: We will show that it is not only hard to compute a  g ∈ Gas required, but in fact it is even hard to determine if such g exists vs. the case that  p1 = p2.

Specifically let us call an algorithm A a testing algorithm for G with sample complexity  m(ǫ, δ) if Areceives IID samples from two distributions  p1and  p2 of size m(ǫ, δ) and returns either EQUIV ALENT or DISTINCT such that w.p. (1  − δ):

If  p1 = p2the algorithm returns EQUIV ALENT .

If IPMG(p1, p2) > ǫthe algorithm returns DISTINCT

Theorem 10. Let G be a k–distinguishing class with gVC(G) = ρ. Anytesting algorithm A with sample complexity  m(ǫ, δ)must observe � √ρ27k3ǫ2�

image

Clearly, theorem 3 is a corollary of theorem 10. Indeed if A is a discriminating algorithm for G with sample complexity  m(ǫ, δ) we can apply it over a sample of size  m(ǫ/3, δ) to receive (w.p. 1  − δ) a graph g s.t.

image

With an additional sample of size  O(k2 log 1/δǫ2) we can estimate  |Ep1(g) −Ep2(g)|within accuracy  ǫ/3, and verify if IPMG(p1, p2) < ǫ: The test will then output  EQUIV ALENT if |Ep1(g) − Ep2(g)| < ǫ3.

To conclude, we constructed a testing algorithm with sample complexity m(ǫ, δ) + C k2 log 1/δǫ2 . Assuming ρis sufficiently large, in particular √ρ27k3 ≫

image

The proof is done by induction. For the induction, we will assume a more fine-grained lower bound. We will assume that there exists a constant C so that for every  n ≤ k − 1, if mn(ǫ, δ) is the sample complexity of a testing algorithm for an n-distinguishing class then:

image

C > 0 will depend only on the constant for the lower bound for testing if two distributions are distinct or  ǫ-far in total variation, as in theorem 9.

We start with the case k = 1. k = 1 The case k = 1 follows directly from theorem 9. Let D be a

class with VC dimension  ρ.by restricting our attention to probabilities

supported on the shattered set of size  ρ, we may assume that  |V| = ρ and

that D = P(V). Note then, that for the IPM distance we then have

image

theorem 9 immediately yields the result.

the induction stepWe now proceed with the proof assuming the statement holds for  k − 1.

By assumption gVC(G) = ρ. Fix v ∈ Vsuch that gVC(Gv) = ρ. Forevery  q ∈ (0,1) and distribution p denote

image

We next state the core Lemma we will need for the proof:

Lemma 2. Let G be a family of k-hypergraphs and  p1, p2two distributions. Assume that for some  v ∈ Vwe have that:

image

We deter the proof of lemma 2 to appendix B.2.2, and proceed with the proof of the induction step. Let us denote  δk = 2−k log k and denoteck = 2−3k2.

image

theorem 10. We can now construct a testing algorithm for  Gvwith sample complexity

image

image

We now show that if  p1 = p2the algorithm outputs w.p. (1  − δ)EQUIV ALENT : Indeed, since  p1 = p2, we have that  pq1 = pq2 for allq: Applying union bound we have that w.p. (1  − δ) the algorithm indeed outputs EQUIV ALENT .

On the other hand, if IPMGv(p1, p2) ≥ ǫwe have by lemma 2 that for one of the distributions (pq1, pq2), IPMG(pq1, pq2) > ckǫ, in particular the al- gorithm will output DISTINCT with probability (1  − δ). Overall weconstructed a testing algorithm for  Gvwith sample complexity as in eq. (7). Reparametrizing we obtain:

image

If  kδ < 2−(k−1) log(k−1), in particular  δ < 2−k log k: we obtain from the induction hypothesis that

image

and the result immediately follows.

image

Denote

image

One can show that

image

where  pg(q) is some k−2 degree polynomial in q whose coefficient depend on  g and p1 and p2. We next apply the following claim

image

Proof Sketch. We provide a full proof for this claim in appendix D.1. In a nutshell, claim 1 follows from the equivalence between norms in finite dimensional spaces. Indeed, the mapping

image

where  pa(x) = � aixi is known to be a non–singular linear transformation induced by the appropriate Vandermonde matrix (specifically.  Vi,j = ((i −1)/k))j−1). Letting λminbe the smallest singular value of the matrix V , we know that  ∥V a∥2 ≥ λmin∥a∥2. where ais the vector of coefficients of the polynomial  pa.

Finally, we exploit the relation in  Rk+1: ∥x∥∞ ≤ ∥x∥2 ≤√k + 1∥x∥∞.We can, thus, relate the max norm of the coefficient vector  ∥a∥∞ ≥ |a1| tothe maximum value maxi∈{0,...,k}� aj(i/k)j = ∥V a∥∞ to obtain

image

It remains only to lower bound the singular values of V , this is done in the full proof in appendix D.1.

With claim 1 in mind we prove the result as follows: First, suppose that for some  g ∈ Gwe have that

image

In this case, applying claim 1 with  a0 = ∆g0(p1, p2) and a1 = k (∆g0(p1, p2) − ∆g1(p1, p2))

and  p = pg, we obtain that there exists a value q = j/k such that IPMG(pq1, pq2) ≥ǫ

image

For any  g ∈ G, by assumption we have that  |∆g1(p1, p2)| > ǫ, for some g ∈ G.Hence  |∆g0(p1, p2)| > ǫ/2. By definition of ∆0we have that for q = 0 we obtain that: IPMG(pq1, pq2) = |E(pq1) − E(pq2)| > ǫ2.

image

Theorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity  m(ǫ, δ) = O(log 1/δǫ2 ) (in fact |G| = 1) such that: for any 1-distinguishing class D with finite VC dimension, and every  ǫ > 0 there aretwo distributions  p1, p2 such that IPMD(p1, p2) < ǫ but IPMG(p1, p2) > 1/2

image

to be a bipartite graph. We thus, divide the vertices into two infinite sets:  V1and  V2the elements of  V1will be indexed by  N i.e. V1 = {v1, v2, · · · } and weindex the elements of  V2with finite subsets of  N V2 = {vA : A ⊆ N, |A| < ∞}.Next we define g so that an edge passes between  vi ∈ V1 and vA ∈ V2 iffi ∈ A.

Let D be a distinguishing class with finite sample complexity, in particular gVC(D) < ∞. Denote gVC(D) = ρ. Let D1be the restriction of D to V1: Note that gVC(D1) ≤ ρ.

Next we make the following claim:

image

Proof. To construct two such distributions, choose a set  S ⊆ V1 of size mlarge enough (to be determined later). Then, randomly choose two samples S1 and S2 out of S(uniformly), each of size  O( ρǫ2). Then, by theorem 2 with some constant probability we have that IPMD(pS1, pS) < ǫ/2 and similarly IPMD(pS, pS2) < ǫ/2 . Taken together we obtain that IPMG(pS1, pS2) < ǫ.

Also, if S is sufficiently large (say, of order  O(ρ2ǫ4)), we would have that w.h.p  S1 ∩ S2 = ∅. Thus, let  q1 = pS1 and q2 = pS2.

With claim 2, we proceed with the proof. Let  q1 and q2be as in claim 2. Let A be the support of  q1, and define  p1to be a distribution  p1 = 12δA + 12q1and similarly we define  p2 = 12δA + 12q2. We then have

image

On the other hand, note that for  p1the probability to draw an edge from g is at least 1/2 (indeed if  v1 = vA and v2 ̸= vAdrawn from  q1 theng(v1, v2) = 1. On the other hand, the probability to draw an edge from  p2is 0. It follows that IPMG(p1, p2) > 12.

image

Theorem 5. Let V = N. There exists a k-distinguishing class  Gk, with sample complexity  m(ǫ, δ) = O(k2+log 1/δǫ2 )such that: For any  k−1-distinguishing class  Gk−1with bounded sample complexity, and every  ǫ > 0there are two distributions  p1, p2 such that IPMGk−1(p1, p2) < ǫ and IPMGk(p1, p2) > 1/4.

image

vertices into two infinite sets  V1 and V2. Again, the elements of  V1 will beindexed by N, and the elements of  V2are indexed by finite subsets of N. V2 = {vA : A ⊆ N, |A| < ∞}.

We define the hyper graph  gkto be a (undirected) graph that contains a hyperedge (vi1, . . . , vik−1, vA) whenever {i1 . . . , ik−1} ⊆ A.

Next, as before we construct two distributions with distinct support such that IPMG(p1, p2) ≤ ǫ. This is done similar to the proof of theorem 4. Specifically:

Claim 3. Let G be a k −1-distinguishing class defined on  V1. There are two distributions,  q1 and q2, supported on  V1 so that

image

The proof is a repetition of the proof of claim 2, where we draw  S1 andS2to be order of  O(k2ρǫ2), and again invoke theorem 2.

As before, then, given a class  G of k −1–hypergraphs we take two distributions  q1 and q2as in claim 3 and if A is the support of  q1, we takep1 = 1kδvA + (1 − 1k)q1 and let p2 = 1kδvA + (1 − 1k)q2. Then, we can show that IPMG(p1, p2) ≤ ǫ. On the other hand, the probability to draw an edge from  gk is k · 1k(1− 1k)k−1 ≥ e−1according to  p1, but the probability to draw an edge from  p2 is 0.

image

Theorem 6. Let |V| = n. There exists a k-distinguishing class  Gk, withsample complexity  m(ǫ, δ) = O(k2+log 1/δǫ2 ) (in fact |G| = 1) such that: For any  ǫ > 0 and any k − 1distinguishing class  Gk−1 if:

image

then gVC(Gk−1) = Ω( ǫ2k2√log n).

The proof is similar to the proof of theorem 5. For simplicity, let us assume that |V| = n+log n. This will not change the results up to constants.

image

and  V2. We index the elements of  V1 as {v1, . . . , vlog n}and we index the elements of  V2with subsets of [log n]. We then consider a graph g that contains only hyper-edges of the form (vi1, . . . , vk−1, vA) iff {i1, . . . , ik−1} ∈A.

image

m(ǫ, δ) = O�ρk2ǫ2�be an upper bound on the sample complexity of classes of graph VC dimension  ρ.

image

butions  q1, q2 over [log n], with disjoint support such that IPMGk(q1, q2) < ǫ.The proof is done as in claim 3.

image

{1, . . . , log n} of size m(ǫ/8, 0.99). One can show that w.p 1/4 we have that S1 ∩ S2are distinct, also we have w.p 0.98 that IPMG(pS, pS1) < ǫ/8 andsimilarly IPMG(pS, pS2) < ǫ/8. Taken together we obtain that with positive probability  q1 = pS1 and q2 = pS2have disjoint support and IPMG(q1, q2) <

image

p1 = 1kδvA + (1 − 1k)q1and similarly  p2 = 1kδvA + (1 − 1kq2. One can show that IPMG(p1, p2) < ǫ4 but the probability to draw an edge from g according to  q1is at least 1/4, while it equals 0 if we draw edges according to  p2.

image

ǫIPMG. In other words, if IPMGk ≻ ǫ · IPMG then log n ≤ m2(ǫ/8, 0.99).

image

image

Theorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity  m(ǫ, δ) = O(log 1/δǫ2 ) (in fact |G| = 1) such that: For any ǫ > 0and any distinguishing class D if:

image

then gVC(D) = ˜Ω(ǫ2 log n).

The proof is similar to the proof of theorem 4 but we will use an improved upper bound on the size of S which we next state (see appendix C.5 for a proof):

Lemma 3. Let D be a class with gVC(D) = ρover a domain S. There exists a constant c > 0 (independent of D and d) such that if  |S| > c· dǫ2 log2(d/ǫ2),Then there are two distributions  q1 and q2, supported on S such that:

1.  q1, and q2have disjoint support.

2. IPMD(q1, q2) < ǫ

The graph g is constructed as in theorem 4. Let V be a set of vertices of size  n + log n, let V1be a set of size log n and we index its elements with {v1, . . . , v2, . . . , vlog n}. We let V2include all other elements and we index them via subsets of [log n]. The graph is again constructed so that  vA ∈ V2has an edge to  vi ∈ V1 iff i ∈ A. As before, we make the graph bipartite, i.e. both  V1 and V2are independent sets.

Now suppose log  n ≥ c ρǫ2 log2 dǫ2. By lemma 3 we have that there exists a set  A ⊆ {1, . . . , log n}, a distribution  p1 and p2 where p1is supported on A and  p2is supported on its compelement so that IPMG(p1, p2) < ǫ. As before we construct  q1 = δvA + (1 − δ)p1 and q2 = δvA + (1 − δ)p2. One can verify that IPMG(q1, q2) < ǫ but IPMGk+1(q1, q2) > 12. Thus, if IPMGk ≻ ǫ·IPMGk+1then log  n ≤ c ρǫ2 log2 dǫ2. In turn d = ˜Ω(ǫ2 log n).

image

D} and denote  H = H 2ǫ2 ln |S|. Note that

image

It thus follows that there exists  f /∈ H. Let fbe such and define a matrix

image

Now suppose that for some distribution q over S, for every d we have that Ev∼q[d(v) = f(v)] < 12 + 1ǫ. Then, defining  q1 = q(·|f(v) = 0) and q2 =q(·|f(v) = 1) yields the desired result. Indeed,

image

We now wish to prove that indeed, such a q exists. Suppose, otherwise: That for any distribution q over S we can find  d such that Ev∼q[d(v) =f(v)] > 12 + 1ǫ. This can be rephrased in terms of a value of a minimax game as follows:

image

Where ∆(S) denotes the set of distributions over S. It is well known ([16], thm 2), that for any game defined by any matrix M with c columns, there exists a strategy for the row player that chooses uniformly from a multiset of ln c2ǫ2and achieves  ǫ-optimiality.

image

this contradicts the fact that  f /∈ H.

image

image

Consider the Vandermonde Matrix  V ∈ Mk+1,k+1 given by Vi,j =�i−1k �j−1.Our first step will be to lower bound the smallest singular value of V . In turn, we will obtain a lower bound on the maximum value over the coordinates of the vector V a. The proof can then be derived from the identity: (V a)i =�k+1j=1 aj� i−1k �j.

Let  λ1 ≤ λ2 ≤ . . . ≤ λk+1be the singular values of V . To bound the smallest singular value,  λ1, we first observe that  λk+1– the highest singular value is bounded by k + 1. To see that  λk+1 ≤ k+ 1, observe that for any vector  ∥a∥ ≤1 we have that

image

Next, using the formula for the determinant of a Vandermonde matrix, and the relation det(V ) = � λi, we obtain:

image

Taken together we obtain

image

Finally, for any polynomial  p = � aiqi with coefficient  |a1|we have that ∥a∥2 ≥ |a1|. We thus obtain,

image


Designed for Accessibility and to further Open Science