The task of discrimination consists of a discriminator that receives finite samples from two distributions, say , and needs to certify whether the two distributions are distinct. Discrimination has a central role within the framework of Generative Adversarial Networks [12], where a discriminator trains a neural net to distinguish between samples from a real-life distribution and samples generated synthetically by another neural network, called a generator.
A possible formal setup for discrimination identifies the discriminator with some distinguishing class In turn, the discriminator wishes to find the best
that distinguishes between the two distributions. Formally, she wishes to find
For examples, in GANs, the class of distinguishing functions we will consider could be the class of neural networks trained by the discriminator.
The first term in the RHS of eq. (1) is often referred to as the Integral Probability Metric (IPM distance) w.r.t a class D [17], denoted IPMsuch, we can think of the discriminator as computing the IPM
Whether two, given, distributions can be distinguished by the discriminator becomes, in the IPM setup, a property of the distinguishing class. Also, the number of examples needed to be observed will depend on the class in question. Thus, if we take a large expressive class of distinguishers, the discriminator can potentially distinguish between any two distributions that are far in total variation. In that extreme, though, the class of distinguishers would need to be very large and in turn, the number of samples needed to be observed scales accordingly. One could also choose a “small” class, but at a cost of smaller distinguishing power that yields smaller IPM distance.
For example, consider two distributions over [n] to be distinguished. We could choose as a distinguishing class the class of all possible subsets over n. This distinguishing class give rise to the total variation distance, but the sample complexity turns out to be O(n). Alternatively we can consider the class of singletones: This class will induce a simple IPM distance, with graceful sample complexity, however in worst case the IPM distance can be as small as O(1/n) even though the total variation distance is large.
Thus, IPM framework initiates a study of generalization complexity where we wish to understand what is the expressive power of each class and what is its sample complexity.
For this special case that D consists of Boolean functions, the problem turns out to be closely related to the classical statistical learning setting and prediction [22]. The sample complexity (i.e., number of samples needed to be observed by the discriminator) is governed by a combinatorial measure termed VC dimension. Specifically, for the discriminator to be able to find a d as in eq. (1), she needs to observe order of Θ( ) examples, where
the VC dimension of the class D [5, 22].
In this work we consider a natural extension of this framework to more sophisticated discriminators: For example, consider a discriminator that observes pairs of points from the distribution and checks for collisions – such a distinguisher cannot apriori be modeled as a test of Boolean functions, as the tester measures a relation between two points and not a property of a single point. The collision test has indeed been used, in the context of synthetic data generation, to evaluate the diversity of the synthetic distribution [2].
More generally, suppose we have a class of 2-ary Boolean functions: G = and the discriminator wishes to (approximately) compute
Here denotes the product distribution over p. More generally, we may consider k-ary mappings, but for the sake of clarity, we will restrict our attention in this introduction to k = 2. Such 2-ary Boolean mapping can be considered as graphs where
1 symbolizes that there exists an edge between
and similarly
) = 0 denotes that there is no such edge. The collision test, for example, is modelled by a graph that contains only self–loops. We thus call such multi-ary statistical tests graph-based distinguishers. Two natural question then arise
1. Do graph–based discriminators have any added distinguishing power over classical discriminators?
2. What is the sample complexity of graph–based discriminators?
With respect to the first question we give an affirmative answer and we show a separation between the distinguishing power of graph–based discriminators and classical discriminators. As to the second question, we introduce a new combinatorial measure (termed graph VC dimension) that governs the
sample complexity of graph–based discriminators – analogously to the VC characterization of the sample complexity of classical discriminators. We next elaborate on each of these two results. As to the distinguishing power of graph–based discriminators, we give an affirmative answer in the following sense: We show that there exists a single graph g such that, for any distinguishing class D with bounded VC dimension, and , there are two distributions
indistinguishable but g certifies that
are distinct. Namely, the quantity in eq. (2) is at least 1/4 for G = {g}. This result may be surprising. It is indeed known that for any two distributions that are
–far in total variation, there exists a boolean mapping d that distinguishes between the two distributions. In that sense, distinguishing classes are known to be universal. Thus, asymptotically, with enough samples any two distribution can be ultimately distinguished via a standard distinguishing function. Nevertheless, our result shows that, given finite data, the restriction to classes with finite capacity is limiting, and there could be graph-based distinguishing functions whose distinguishing power is not comparable to any class with finite capacity. We stress that the same graph competes with all finite–capacity classes, irrespective of their VC dimension. With respect to the second question, we introduce a new VC-like notion termed graph VC dimension that extends naturally to graphs (and hypergraphs). On a high level, we show that for a class of graph-based distinguishers with graph VC dimension
) examples are sufficient for discrimination and that Ω(
) examples are necessary. This leaves a gap of factor
which we leave as an open question. The notion we introduce is strictly weaker than the standard VC–dimension of families of multi-ary functions, and the proofs we provide do not follow directly from classical results on learnability of finite VC classes [22, 5]. In more details, a graph-based distinguishing class G is a family of Boolean functions over the product space of vertices
. As such it is equipped with a VC dimension, the largest set of pairs of vertices that is shattered by G. It is not hard to show that finite VC is sufficient to achieve finite sample complexity bounds over 2-ary functions [9]. It turns out, though, that it is not a necessary condition: For example, one can show that the class of k-regular graphs has finite graph VC dimension but infinite VC dimension. Thus, even though they are not learnable in the standard PAC setting, they have finite sample complexity within the framework of discrimination. The reason for this gap, between learnability and discriminability, is that
learning requires uniform convergence with respect to any possible distribution over pairs, while discrimination requires uniform convergence only with respect to product distributions – formally then, it is a weaker task, and, potentially, can be performed even for classes with infinite VC dimension.
1.1 Related Work
The task of discrimination has been considered as early as the work of Vapnik and Chervonenkis in [22]. In fact, even though Vapnik and Chervonenkis original work is often referred in the context of prediction, the original work considered the question of when the empirical frequency of Boolean functions converges uniformly to the true probability over a class of functions. In that sense, this work can be considered as a natural extension to k-ary functions and generalization of the notion of VC dimension.
The work of [9, 8] studies also a generalization of VC theory to multi-ary functions in the context of ranking tasks and U-statistics. They study the standard notion of VC dimension. Specifically they consider the function class as Boolean functions over multi-tuples and the VC dimension is defined by the largest set of multi-tuples that can be shattered. Their work provides several interesting fast-rate convergence guarantees. As discussed in the introduction, our notion of capacity is weaker, and in general the results are incomparable.
GANs A more recent interest in discrimination tasks is motivated by the framework of GANs, where a neural network is trained to distinguish between two sets of data – one is real and the other is generated by another neural network called generator. Multi-ary tests have been proposed to assess the quality of GANs networks. [2] suggests birthday paradox to evaluate diversity in GANs. [19] uses Binning to assess the solution proposed by GANs.
Closer to this work [15] suggests the use of a discriminator that observes samples from the m-th product distribution. Motivated by the problem of mode collapse they suggest a theoretical framework in which they study the algorithmic benefits of such discriminators and observe that they can significantly reduce mode collapse. In contrast, our work is less concerned with the problem of mode collapse directly and we ask in general if we can boost the distinguishing power of discriminators via multi-ary discrimination. Moreover, we provide several novel sample complexity bounds.
Property Testing A related problem to ours is that of testing closeness of distributions [3, 11]. Traditionally, testing closeness of distribution is concerned with evaluating if two discrete distributions are close vs. far/identical in total variation. [11], motivated by graph expansion test, propose a collision test to verify if a certain distribution is close to uniform. Interestingly, a collision test is a graph-based discriminator which turns out to be optimal for the setting[18]. Our sample–complexity lower bounds are derived from these results. Specifically we reduce discrimination to testing uniformity [18]. Other lower bounds in the literature can be similarly used to achieve alternative (yet incomparable bounds) (e.g. [7] provides a Ω(bounds for testing whether two distributions are far or close).
In contrast with the aforementioned setup, here we do not measure distance between distributions in terms of total variation but in terms of an IPM distance induced by a class of distinguishers. The advantage of the IPM distance is that it sometimes can be estimated with limited amount of samples, while the total variation distance scales with the size of the support, which is often too large to allow estimation.
Several works do study the question of distinguishing between two distributions w.r.t a finite capacity class of tests, Specifically the work of [14] studies refutation algorithms that distinguish between noisy labels and labels that correlate with a bounded hypothesis class. [21] studies a closely related question in the context of realizable PAC learning. A graph-based discriminator can be directly turned to a refutation algorithm, and both works of [14, 21] show reductions from refutation to learning. In turn, the agnostic bounds of [14] can be harnessed to achieve lower bounds for graph-based discrimination. Unfortunately this approach leads to suboptimal lower bounds. It would be interesting to see if one can improve the guarantees for such reductions, and in turn exploit it for our setting.
2.1 Basic Notations – Graphs and HyperGraphs
Recall that a k-hypergraph g consists of a a set and a collection of non empty k–tuples over
, which are referred to as hyperedges. If k = 2 then g is called a graph. 1–hypergraphs are simply identified as subsets over V. We will normally use d to denote such 1-hypergraphs and will refer to them as distinguishers. A distinguisher d can be identified with a Boolean function according to the rule:
Similarly we can identify a k-hypergraph with a function
Namely, for any graph g we identify it with the Boolean function
We will further simplify and assume that g is undirected, this means that for any permutation ], we have that
We will call undirected k-hypergraphs, k-distinguishers. A collection of k-distinguishers over a common set of vertices V will be referred to as a k-distinguishing class. If k = 1 we will simply call such a collection a distinguishing class. For k > 1 we will normally denote such a collection with G and for k = 1 we will often use the letter D.
Next, given a distribution P over vertices and a k–hypergraph g let us denote as follows the frequency of an edge w.r.t P:
where we use the notation in shorthand for the sequence (
denotes the product distribution of P k times.
Similarly, given a sample we denote the empirical frequency of an edge:
As a final set of notations: Given a k-hypergraph g a sequence n < k, we define a
–distinguisher
as follows:
In turn, we define the following distinguishing classes: For every sequence , the distinguishing class
is defined as follows:
Finally, we point out that we will mainly be concerned with the case that . However, all the results here can be easily extended to other domains as long as certain (natural) measurability assumptions are given to ensure that VC theory holds (see [22, 4]).
2.2 IPM distance
Given a class of distinguishers D the induced IPM distance [17], denoted by IPM, is a (pseudo)–metric between distributions over V defined as follows
The definition can naturally be extended to a general family of graphs, and we define:
Another metric we would care about is the total variation metric. Given two distributions the total variation distance is defined as:
where goes over all measurable events.
In contrast with an IPM distance, the total variation metric is indeed a
metric and any two distributions we have that TV(
fact, for every distinguishing class
For finite classes of vertices V, it is known that the total variation metric
Further, if we let D = P(V) the power set of V we obtain
2.3 Discriminating Algorithms
Definition 1. Given a distinguishing class G a G-discriminating algorithm A with sample complexity is an algorithm that receives as input two finite samples
of vertices and outputs a hyper-graph
such that:
If are drawn IID from some unknown distributions
tively and
the algorithm’s output satis-fies:
The sample complexity of a class G is then given by the smallest possible sample complexity of a G-discriminating algorithm A.
Namely there exists a discriminating algorithm for G with sample complexity
VC classes are discriminable For the case k = 1, discrimination is closely related to PAC learning. It is easy to see that a proper learning algorithm for a class D can be turned into a discriminator: Indeed, given access to samples from two distributions we can provide a learner with labelled examples from a distribution p defined as follows: p(y = 1) =
. Given access to samples from
we can clearly generate IID samples from the distribution p. If, in turn, we provide a learner with samples from p and it outputs a hypothesis
we have that (w.h.p):
One can also see that a converse relation holds, if we restrict our attention to learning balanced labels (i.e., 1)). Namely, given labelled examples from some balanced distribution, the output of a discriminator is a predictor that competes with the class of predictors induced by D.
Overall, the above calculation, together with Vapnik and Chervonenkis’s classical result [22] shows that classes with finite VC dimension are discriminable with sample complexity
The necessity of finite VC dimension for agnostic PAC-learning was shown in [1]. Basically the same argument shows that given a class
) examples are necessary for discrimination. We next introduce a natural extension of VC dimension to hypergraphs, which will play a similar role.
2.4 VC Dimension of hypergraphs
We next define the notion of graph VC dimension for hypergraphs, as we will later see this notion indeed characterizes the sample complexity of discriminating classes, and in that sense it is a natural extension of the notion of VC dimension for hypotheses classes:
Definition 2. Given a family of k-hypergraphs, G: The graph VC dimension of the class G, denoted gVC(G), is defined inductively as follows: For k = 1 gVC(G) is the standard notion of VC dimension, i.e., gVC(G) = VC(G). For k > 1:
Roughly, the graph VC dimension of a hypergraph is given by the VC dimension of the induced classes of distinguishers via projections. Namely, we can think of the VC dimension of hypergraphs as the projected VC dimension when we fix all coordinates in an edge except for one.
We next describe the main results of this work. The results are divided into two sections: For the first part we characterize the sample complexity of graph–based distinguishing class. The second part is concerned with the expressive/distinguishing power of graph–based discriminators. All proofs are provided in appendices B and C respectively.
3.1 The sample complexity of graph-based distinguishing class
We begin by providing upper bounds to the sample complexity for discrim- ination
Theorem 1 (Sample Complexity – Upper Bound). Let G be a k–distinguishing class with gVC(has sample complexity
Theorem 1 is a corollary of the following uniform convergence upper bound for graph-based distinguishing classes.
Theorem 2 (uniform convergence). Let G be a k–distinguishing class with gVC(be an IID sample of vertices drawn from some unknown distribution
then with probability at least (1
(over the randomness of S):
The proof of theorem 2 is given in appendix B.1. We next provide a lower bound for the sample complexity of discriminating algorithms in terms of the graph VC dimension of the class
We refer the reader to appendix B.2 for a proof of theorem 3. Our upper bounds and lower bounds leave a gap of order ). As dicussed in section 2.3, for the case k = 1 we can provide a tight
) bound through a reduction to agnostic PAC learning and the appropriate lower bounds[1]. In general it would be interesting to improve the above bound both in terms of
3.2 The expressive power of graph-based distinguishing class
So far we have characterized the discriminability of graph-based distinguishing classes. It is natural though to ask if graph–based distinguishing classes add any advantage over standard 1-distinguishing classes. In this section we provide several results that show that indeed graph provide extra expressive power over standard distinguishing classes.
We begin by providing a result over infinite graphs. (proof is provided in appendix C.1)
Theorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity ) such that: for any 1-distinguishing class D with finite VC dimension, and every
two distributions
Theorem 4 can be generalized to higher order distinguishing classes (see appendix C.2 for a proof):
Theorem 5. Let V = N. There exists a k-distinguishing class , with sample complexity
such that: For any
-distinguishing class
with bounded sample complexity, and every
there are two distributions
Finite Graphs We next study the expressive power of distinguishing graphs over finite domains.
It is known that, over a finite domain V = {1, . . . , n}, we can learn with a sample complexity of ) any distinguishing class. In fact, we can learn the total variation metric (indeed the sample complexity of P(V) is bounded by log |P(V )| = n).
Therefore if we allow classes whose sample complexity scales linearly with n we cannot hope to show any advantage for distinguishing graphs. However, in most natural problems n is considered to be very large (for example, over the Boolean cube n is exponential in the dimension). We thus, in general, would like to study classes that have better complexity in terms of n. In that sense, we can show that indeed distinguishing graphs yield extra expressive power.
In particular, we show that for classes with sublogarithmic sample complexity, we can construct graphs that are incomparable with a higher order distinguishing class.
Theorem 6. Let |V| = n. There exists a k-distinguishing class sample complexity
) such that: For any
distinguishing class
The proof is given in appendix C.3. We can improve the bound in theorem 6 for the case k = 1 (see appendix C.4 for proof).
Theorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity ) such that: For any
and any distinguishing class D if:
In this work we developed a generalization of the standard framework of discrimination to graph-based distinguishers that discriminate between two distributions by considering multi-ary tests. Several open question arise from our results:
Improving Sample Complexity Bounds In terms of sample complexity, while we give a natural upper bound of ), the lower bound we provide are not tight neither in d nor in k and we provide a lower bound of Ω(
of k.
Improving Expressiveness Bounds We also showed that, over finite domains, we can construct a graph that is incomparable with any class with VC dimension Ω(). The best upper bound we can provide (the VC of a class that competes with any graph) is the naive O(n) which is the VC dimension of the total variation metric.
Additionally, for the k-hypergraph case, our bounds deteriorate to a Ω(). The improvement in the graph case follows from using an argument in the spirit of Boosting [10] and Hardcore Lemma [13] to construct two indistinguishable probabilities with distinct support over a small domain. It would be interesting to extend these techniques in order to achieve similar bounds for the k > 2 case.
Relation to GANs and Extension to Online Setting Finally, a central motivation for learning the sample complexity of discriminators is in the context of GANs. It then raises interesting questions as to the foolability of graph-based distinguishers.
The work of [6] suggests a framework for studying sequential games between generators and discriminators (GAM-Fooling). In a nutshell, the GAM setting considers a sequential game between a generator G that outputs distributions and a discriminator D that has access to data from some distribution (not known to G). At each round of the game, the generator proposes a distribution and the discriminator outputs a
which distinguishes between the distribution of G and the true distribution
class D is said to be GAM-Foolable if the generator outputs after finitely many rounds a distribution p that is D–indistinguishable from
[6] showed that a class D is GAM–foolable if and only if it has finite Littlestone dimension. We then ask, similarly, which classes of graph–based distinguishers are GAM-Foolable? A characterization of such classes can potentially lead to a natural extension of the Littlestone notion and on-line prediction, to graph-based classes analogously to this work w.r.t VC dimension
Acknowledgements The authors would like to thank Shay Moran for helpful discussions and suggesting simplifications for the proofs of theorems 4 to 6.
[1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
[2] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
[3] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 259–269. IEEE, 2000.
[4] Shai Ben-David. 2 notes on classes with vapnik-chervonenkis dimension 1. arXiv preprint arXiv:1507.05307, 2015.
[5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
[6] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models for fooling discriminators. arXiv preprint arXiv:1902.03468, 2019.
[7] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193–1203. SIAM, 2014.
[8] St´ephan Cl´emen¸con, Igor Colin, and Aur´elien Bellet. Scaling-up em- pirical risk minimization: optimization of incomplete u-statistics. The Journal of Machine Learning Research, 17(1):2682–2717, 2016.
[9] St´ephan Cl´emen¸con, G´abor Lugosi, Nicolas Vayatis, et al. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844–874, 2008.
[10] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In COLT, volume 96, pages 325–332. Citeseer, 1996.
[11] Oded Goldreich and Dana Ron. On testing expansion in boundeddegree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68–75. Springer, 2011.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[13] Russell Impagliazzo. Hard-core distributions for somewhat hard prob- lems. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 538–545. IEEE, 1995.
[14] Pravesh K Kothari and Roi Livni. Agnostic learning by refuting. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
[15] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1498–1507, 2018.
[16] Richard J Lipton and Neal E Young. Simple strategies for large zero- sum games with applications to complexity theory. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 734–740. ACM, 1994.
[17] Alfred M¨uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
[18] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
[19] Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pages 5847–5858, 2018.
[20] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[21] Salil P. Vadhan. On learning vs. refutation. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1835–1848, 2017.
[22] Vladimir N Vapnik and Aleksei Yakovlevich Chervonenkis. The uni- form convergence of frequencies of the appearance of events to their probabilities. In Doklady Akademii Nauk, volume 181, pages 781–783. Russian Academy of Sciences, 1968.
We begin with a brief overview of some classical results in Statistical Learning theory which characterizes VC classes. Throughout we assume a domain X and a hypothesis class which is a family of Boolean functions over X:
Theorem 8. [Within proof of Thm. 6.11 in [20]] Let H be a class with VC dimension
Recall that a class H has the uniform convergence property, if for some is some unknown distribution and
sample drawn IID from
) then w.p. (1
the sample S):
The following, high probability analogue of theorem 8, is also an immediate corollary of Theorem 6.8 in [20]
Corollary 1. [Within Thm 6.8 [20]] Let D be a class with VC dimension . There exists a constant C > 0, such that: Let p be a distribution with finite support over V. Let S be an IID sequence
of m elements drawn from p, and denote by the empirical distribution over
(over the random choice of S) we have that
The problem of testing the closeness of two discrete distributions can be phrased as follows: Given samples from two distributions tester needs to distinguish between the case
and the case that
. We will rely on the following result which follows immediately from a uniformity test lower bound due to [18].
and access to samples from distributions
any algorithm that returns with probability 2/3 EQUIV ALENT
if and returns
must observe at least Ω
We note that [7] gives a slightly better lower bound, of an order of Ω. However, to simplify we will focus on rates of
) that scale quadratically in
Theorem 2 (uniform convergence). Let G be a k–distinguishing class with gVC(be an IID sample of vertices drawn from some unknown distribution
then with probability at least (1
(over the randomness of S):
Fix a k–distinguishing class G with graph VC dimension . As in the standard proof of uniform convergence for VC classes, we first prove the statement in expectation and then apply Mcdiarmid’s inequality to prove the result w.h.p. Specifically, we will use the following Lemma (whose proof is given in appendix B.1.1):
Lemma 1 (Uniform Convergence in Expectation). Let G be a k–distinguishing class with gVC(be an IID sample of vertices drawn from some unknown distribution P. Then,
We next proceed with the proof of theorem 2, assuming the correctness of lemma 1. Define
Let ) be a sample and
, some sequence that differ from S only in the i-th vertex then we will show that:
Once we show eq. (4) holds, the result indeed follow from Mcdiarmid’s inequality and lemma 1. Specifically if we assume that
Applying Mcdiarmid’s we obtain that with probability at least (1 over the sample S:
Noting that , we obtain that with probability at least (1
We are thus left with proving that eq. (4) holds.
For an index , let us denote by
-subsets of indices from {1, . . . , m} that include i and we let
-sequences that do not include i. Given a set
-subsets of S that include
be all the k-subsets that do not include
denote
And similarly
Then, let be two samples that differ on the i-th example. Specifi-cally assume that
. Note that
We are thus left with proving lemma 1:
The proof of the statement follows by induction. The case k = 1 is the standard uniform convergence property of VC classes, and it follows from theorem 8.
We next proceed to prove the statement for k, assuming it holds for
We begin with the following, triangular, inequality:
We next bound the two terms
where we denoted by the uniform distribution over
1-tuples from S. The expectation in the last expression is thus taken w.r.t a process where we pick m elements according to P and then partition them to
elements and to a sequence
of distinct elements. This process is equivalent to simply choosing
+ 1 elements according to P, and then picking
1 new elements, again, according to P as follows:
Note that the quantity ) is dependent on
, namely these are random sampled choices that depend on our choice of distinguishing class. To bound their effect we next add and subtract auxiliary random variables
sampled IID according to P:
Renaming we can write:
Finally we apply. theorem 8. Recalling that gVC(the sequence S is drawn IID independent of the choice
, we obtain for every fixed (
We now use the induction hypothesis: Note that 1)-distinguishing class with gVC(
for every choice of v. Thus, fixing v:
Continuing the proof With the aforementioned bound on the terms * and ** we now obtain
∗ +
To prove theorem 3 we will in fact prove a stronger statement: We will show that it is not only hard to compute a as required, but in fact it is even hard to determine if such g exists vs. the case that
Specifically let us call an algorithm A a testing algorithm for G with sample complexity receives IID samples from two distributions
and
) and returns either EQUIV ALENT or DISTINCT such that w.p. (1
• If the algorithm returns EQUIV ALENT .
• If IPMthe algorithm returns DISTINCT
Theorem 10. Let G be a k–distinguishing class with gVC(testing algorithm A with sample complexity
must observe Ω
Clearly, theorem 3 is a corollary of theorem 10. Indeed if A is a discriminating algorithm for G with sample complexity ) we can apply it over a sample of size
) to receive (w.p. 1
With an additional sample of size ) we can estimate
within accuracy
3, and verify if IPM
: The test will then output
To conclude, we constructed a testing algorithm with sample complexity is sufficiently large, in particular
The proof is done by induction. For the induction, we will assume a more fine-grained lower bound. We will assume that there exists a constant C so that for every ) is the sample complexity of a testing algorithm for an n-distinguishing class then:
C > 0 will depend only on the constant for the lower bound for testing if two distributions are distinct or -far in total variation, as in theorem 9.
We start with the case k = 1. k = 1 The case k = 1 follows directly from theorem 9. Let D be a
class with VC dimension by restricting our attention to probabilities
supported on the shattered set of size , we may assume that
that D = P(V). Note then, that for the IPM distance we then have
theorem 9 immediately yields the result.
the induction stepWe now proceed with the proof assuming the statement holds for
By assumption gVC(such that gVC(
every
1) and distribution p denote
We next state the core Lemma we will need for the proof:
Lemma 2. Let G be a family of k-hypergraphs and two distributions. Assume that for some
we have that:
We deter the proof of lemma 2 to appendix B.2.2, and proceed with the proof of the induction step. Let us denote
theorem 10. We can now construct a testing algorithm for with sample complexity
We now show that if the algorithm outputs w.p. (1
EQUIV ALENT : Indeed, since
, we have that
q: Applying union bound we have that w.p. (1
) the algorithm indeed outputs EQUIV ALENT .
On the other hand, if IPMwe have by lemma 2 that for one of the distributions (
, in particular the al- gorithm will output DISTINCT with probability (1
constructed a testing algorithm for
with sample complexity as in eq. (7). Reparametrizing we obtain:
If , in particular
: we obtain from the induction hypothesis that
and the result immediately follows.
Denote
One can show that
where 2 degree polynomial in q whose coefficient depend on
. We next apply the following claim
Proof Sketch. We provide a full proof for this claim in appendix D.1. In a nutshell, claim 1 follows from the equivalence between norms in finite dimensional spaces. Indeed, the mapping
where is known to be a non–singular linear transformation induced by the appropriate Vandermonde matrix (specifically.
1)
be the smallest singular value of the matrix V , we know that
is the vector of coefficients of the polynomial
Finally, we exploit the relation in We can, thus, relate the max norm of the coefficient vector
the maximum value max
It remains only to lower bound the singular values of V , this is done in the full proof in appendix D.1.
With claim 1 in mind we prove the result as follows: First, suppose that for some we have that
In this case, applying claim 1 with
and , we obtain that there exists a value q = j/k such that IPM
For any , by assumption we have that
Hence
2. By definition of ∆
we have that for q = 0 we obtain that: IPM
Theorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity ) such that: for any 1-distinguishing class D with finite VC dimension, and every
two distributions
to be a bipartite graph. We thus, divide the vertices into two infinite sets: and
the elements of
will be indexed by
index the elements of
with finite subsets of
Next we define g so that an edge passes between
Let D be a distinguishing class with finite sample complexity, in particular gVC(. Denote gVC(
be the restriction of D to
: Note that gVC(
Next we make the following claim:
Proof. To construct two such distributions, choose a set large enough (to be determined later). Then, randomly choose two samples
(uniformly), each of size
). Then, by theorem 2 with some constant probability we have that IPM
2 and similarly IPM
2 . Taken together we obtain that IPM
Also, if S is sufficiently large (say, of order )), we would have that w.h.p
. Thus, let
With claim 2, we proceed with the proof. Let be as in claim 2. Let A be the support of
, and define
to be a distribution
and similarly we define
. We then have
On the other hand, note that for the probability to draw an edge from g is at least 1/2 (indeed if
drawn from
) = 1. On the other hand, the probability to draw an edge from
is 0. It follows that IPM
Theorem 5. Let V = N. There exists a k-distinguishing class , with sample complexity
such that: For any
-distinguishing class
with bounded sample complexity, and every
there are two distributions
vertices into two infinite sets . Again, the elements of
indexed by N, and the elements of
are indexed by finite subsets of N.
We define the hyper graph to be a (undirected) graph that contains a hyperedge (
Next, as before we construct two distributions with distinct support such that IPM. This is done similar to the proof of theorem 4. Specifically:
-distinguishing class defined on
. There are two distributions,
, supported on
The proof is a repetition of the proof of claim 2, where we draw to be order of
), and again invoke theorem 2.
As before, then, given a class 1–hypergraphs we take two distributions
as in claim 3 and if A is the support of
. Then, we can show that IPM
. On the other hand, the probability to draw an edge from
according to
, but the probability to draw an edge from
Theorem 6. Let |V| = n. There exists a k-distinguishing class sample complexity
) such that: For any
distinguishing class
The proof is similar to the proof of theorem 5. For simplicity, let us assume that |V| = n+log n. This will not change the results up to constants.
and . We index the elements of
and we index the elements of
with subsets of [log n]. We then consider a graph g that contains only hyper-edges of the form (
A.
be an upper bound on the sample complexity of classes of graph VC dimension
butions ], with disjoint support such that IPM
The proof is done as in claim 3.
99). One can show that w.p 1/4 we have that
are distinct, also we have w.p 0.98 that IPM
similarly IPM
8. Taken together we obtain that with positive probability
have disjoint support and IPM
and similarly
. One can show that IPM
but the probability to draw an edge from g according to
is at least 1/4, while it equals 0 if we draw edges according to
. In other words, if IPM
Theorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity ) such that: For any
and any distinguishing class D if:
The proof is similar to the proof of theorem 4 but we will use an improved upper bound on the size of S which we next state (see appendix C.5 for a proof):
Lemma 3. Let D be a class with gVC(over a domain S. There exists a constant c > 0 (independent of D and d) such that if
Then there are two distributions
, supported on S such that:
1. have disjoint support.
The graph g is constructed as in theorem 4. Let V be a set of vertices of size be a set of size log n and we index its elements with
include all other elements and we index them via subsets of [log n]. The graph is again constructed so that
has an edge to
. As before, we make the graph bipartite, i.e. both
are independent sets.
Now suppose log . By lemma 3 we have that there exists a set
, a distribution
is supported on A and
is supported on its compelement so that IPM
. As before we construct
. One can verify that IPM
. Thus, if IPM
then log
D} and denote . Note that
It thus follows that there exists be such and define a matrix
Now suppose that for some distribution q over S, for every d we have that . Then, defining
) = 1) yields the desired result. Indeed,
We now wish to prove that indeed, such a q exists. Suppose, otherwise: That for any distribution q over S we can find . This can be rephrased in terms of a value of a minimax game as follows:
Where ∆(S) denotes the set of distributions over S. It is well known ([16], thm 2), that for any game defined by any matrix M with c columns, there exists a strategy for the row player that chooses uniformly from a multiset of and achieves
-optimiality.
this contradicts the fact that
Consider the Vandermonde Matrix Our first step will be to lower bound the smallest singular value of V . In turn, we will obtain a lower bound on the maximum value over the coordinates of the vector V a. The proof can then be derived from the identity: (
Let be the singular values of V . To bound the smallest singular value,
, we first observe that
– the highest singular value is bounded by k + 1. To see that
+ 1, observe that for any vector
1 we have that
Next, using the formula for the determinant of a Vandermonde matrix, and the relation det(, we obtain:
Taken together we obtain
Finally, for any polynomial with coefficient
we have that
. We thus obtain,