Generalization and compression/simplification are two basic facets of “learning”. Generalization concerns the expansion of existing knowledge and compression concerns simplifying our explanations of it. In machine learning, compression and generalization are deeply related: learning algorithms perform compression and the ability to compress guarantees good generalization.
A simple form of this connection is how Occam’s Razor (Blumer et al., 1987) is manifested in Machine Learning: if the input sample can be compressed to a small number of bits which encodes a hypothesis consistent with the input sample, then good generalization of this hypothesis is guaranteed. A more sophisticated notion of compression is given by ”sample compression schemes” (Littlestone and Warmuth, 1986). In these schemes the input sample is compressed to a carefully chosen small subsample that encodes a hypothesis consistent with the input sample. For example support vector machine can be seen as compressing the original sample to the subset of support vectors which represent a maximum margin hyperplane that is consistent with the entire original sample.
What is the connection to generalization? In the Occam’s razor setting, the generalization error decreases with the number of bits that are used to encode the output hypothesis. Similarly for compression schemes, the generalization error decreases with the sample size.
In the learning model considered here, the learner is given a sample consistent with an unknown concept from a target concept class. From the given sample, the learner aims to construct a hypothesis that yields a good generalization i.e. a good approximation of the unknown concept. A core question is what parameter of the concept class characterizes the sample size required for good generalization? The Vapnik-Chervonenkis (VC) dimension serves as such a parameter (Blumer et al., 1989) where the exact definition of generalization underlying our discussion is specified by the Probably Approximately Correct (PAC) model of learning (Valiant, 1984; Vapnik and Chervonenkis, 1971). We believe that the size of the best compression scheme is an alternate parameter and has several additional advantages:
• Compression schemes frame many natural algorithms (e.g. support vector machines). This gives sample compression schemes a constructive flavor.
• Unlike the VC dimension, the definition of sample compression schemes as well as the fact that they yield low generalization error extends naturally to multi label concept classes (Samei et al., 2014). This is particularly interesting when the number of labels is very large (or possibly infinite), because for that case there is no known combinatorial parameter that characterizes the sample complexity in the PAC model (See (Daniely and Shalev-Shwartz, 2014)). The size of the best sample compression scheme is therefore a natural candidate for a universal parameter that characterizes the sample complexity in the PAC model.
Previous work
Littlestone and Warmuth (1986) defined sample compression schemes and showed that in the PAC model of learning, the sample size required for learning grows linearly with the size of the subsamples the scheme compresses to. They have also posed the other direction as an open question: Does every concept class have a compression scheme of size depending only on its VC dimension? Later Floyd and Warmuth (1995) and Warmuth (2003), refined this question: Does every class of VC dimension d have a sample compression scheme of size O(d).
Ben-David and Litman (1998) proved a compactness theorem for sample compression schemes. It essentially says that existence of compression schemes for infinite classes followsfrom the existence of such schemes for finite classes. Thus, it suffices to consider only finite concept classes. Floyd and Warmuth (1995) constructed sample compression schemes of size log |C| for every concept class C. More recently Moran et al. (2015) have constructed sample compression schemes of size exp(d) log log |C| where d = V Cdim(C). Finally, Moran and Yehudayoff (2016) have constructed sample compression scheme of size exp(d), resolving Littlestone and Warmuth’s question. Their compression scheme is based on an earlier compression scheme by Freund (1995); Freund and Schapire (2012) which was defined in the context of boosting. This sample compression scheme is of variable size: It compresses samples of size m to subsamples of size O(d log m).
For many natural and important families of concept classes, sample compression schemes of size equal the VC dimension were constructed, revealing connections between sample compression schemes and other fields such as combinatorics, geometry, model theory, and algebraic topology (e.g. Floyd (1989); Helmbold et al. (1992); Ben-David and Litman (1998); Chernikov and Simon (2013); Rubinstein et al. (2009); Rubinstein and Rubinstein (2012); Livni and Simon (2013)). Despite this rich body of work, the refined question whether there exists a compression scheme whose size is equal or linear in the VC dimension remains open.
Floyd and Warmuth (1995) observed that in order to prove the conjecture it suffices to consider only maximal classes (A class C is maximal if no concept can be added without increasing the VC dimension). Furthermore, they constructed sample compression schemes of size d for every maximum class of VC dimension d. These classes are maximum in the sense that their size equals an upper bound (due to Sauer-Shelah) on the size of any concept class of VC dimension d. Later, Kuzmin and Warmuth (2007) and Rubinstein and Rubinstein (2012) provided even more efficient and combinatorially elegant sample compression schemes for maximum classes that are called unlabeled compression schemes because the labels of the subsample are not needed to encode the output hypothesis.
One possibility of making a progress on Floyd and Warmuth’s question is by extending the optimal compression schemes for maximum classes to a more general family. In this paper we consider a natural and rich generalization of maximum classes which are known by the name extremal classes (or shattering extremal classes). Similar to maximum classes, these classes are defined when a certain inequality known as The Sandwich Theorem is tight. This inequality generalizaes the Sauer-Shelah bound. The Sandwich Theorem as well as extremal classes were discovered several times and independently by several groups of researchers and in several contexts such as Functional analysis (Pajor, 1985), Discretegeometry (Lawrence, 1983), Phylogenetic Combinatorics (Dress, 1997; Bandelt et al., 2006) and Extremal Combinatorics (Bollob´as et al., 1989; Bollob´as and Radcliffe, 1995). Even though a lot of knowledge regarding the structure of extremal classes has been accumulated, the understanding of these classes is still considered incomplete by several authors (Bollob´as and Radcliffe, 1995; Greco, 1998; R´onyai and M´esz´aros, 2011).
Our results
Our main result is a construction of sample compression scheme of size d for every extremal class of VC dimension d. When the concept class is maximum, then our scheme specializes to the compression scheme for maximum classes given in (Floyd and Warmuth, 1995). Our generalized sample compression scheme for extremal classes is still easy to describe. However its analysis requires more combinatorics and heavily exploits the rich structure of extremal classes. Despite being more general, the construction is simple. We also give explicit examples of maximal classes that are extremal but not maximum (see Example 5). This means that the compression scheme presented here is not implied by the previous sample compression schemes for maximum classes, not even implicitly.
We also discuss a certain greedy peeling method for producing an unlabeled compressions scheme. Such schemes were first conjectured in (Kuzmin and Warmuth, 2007) and later proven to exist for maximum classes (Rubinstein and Rubinstein, 2012). However the existence of such schemes for extremal classes remains open. We relate the existence of such schemes to basic open questions concerning the combinatorial structure of extremal classes.
Organization
In Section 2 we give some preliminary definitions and define extremal classes. We also discuss some basic properties and give some examples of extremal classes which demonstrate their generality over maximum classes. In Section 3 we give a labeled compression scheme for any extremal class of VC dimension d. Finally, in Section 4 we relate unlabeled compression schemes for extremal classes with basic open questions concerning extremal classes.
2.1 Preliminaries
Concepts, concept classes, and the one-inclusion graph. A concept c is a mapping from some domain to {0, 1}. We assume for the sake of simplicity that the domain of c (denoted by dom(c)) is finite and allow the case that . A concept c can also be viewed as a characteristic function of a subset of dom(c), i.e for any domain point
. A concept class C is a set of concepts with the same domain (denoted by dom(C)). A concept class can be represented by a binary table (see Fig. 1), where the rows correspond to concepts and the columns to the elements of dom(C). Whenever the elements in dom(C) are clear from the context, then we represent concepts as bit strings of length |dom(C)| (See Fig. 1).
The concept class C can also be represented as a subgraph of the Boolean hypercube with |dom(C)| dimensions. Each dimension corresponds to a particular domain element, the vertices are the concepts in C and two concepts are connected with an edge if they disagree on the label of a single element (Hamming distance 1). This graph is called the one-inclusion graph of C (Haussler et al., 1994). Note that each edge is naturally labeled by the single dimension/element on which the incident concepts disagree (See Fig. 1).
Figure 1: Table and one-inclusion graph of an extremal class C of VC dimension 2. The
Restrictions and samples. We denote the restriction/sample of a concept dom(c) as c|S. This concept has the restricted domain S and labels this domain consistently with c. Essentially concept c|S is obtained by removing from row c in the table all columns not in S. The restriction/set of samples of an entire class
) is denoted as C|S. A table for C|S is produced by simply removing all columns not in S from the table for C and collapsing identical rows.
Also the one-inclusion graph for the restriction C|S is now a subgraph of the Boolean hypercube with |S| dimensions instead of the full dimension |dom(C)|. We also use
as shorthand for C|(dom(C) \ S) (since the columns labeled with S are removed from the table). Note that the sub domain
) induces an equivalence class on C: Two concepts
are equivalent iff
. Thus there is one equivalence class per concept of C|S.
Cubes. A concept class B is called a cube if for some subset S of the domain dom(B), the restriction B|S is the set of all 2concepts over the domain S and the class
contains a single concept. We denote this single concept by tag(B). In this case, we say that S is the dimension set of B (denoted as dim(B)). For example, if B contains two concepts that are incident to an edge labeled x then B is a cube with dim(B) = {x}. We say that B is a cube of concept class C if B is a cube that is a subset of C. We say that B is a maximal cube of C if there exists no other cube of C which strictly contains B. When the dimensions are clear from the context, then a concept is described as a bit string of length dom(C). Similarly a cube, B, is described as an expression in
the dimensions of dim(B) are the *’s and the remaining bits is the concept tag(B).
Reductions. In addition to the restriction it is common to define a second operation on concept classes. We will describe this operation using cubes. The concept class on the domain dom(C) \ S which has one concept per cube with dimensions set S
The reduction with respect to a single dimension x is denoted as . See Fig. 1 for some examples.
Shattering and strong shattering. There are two important properties associated with subsets S of the domain of C. We say that the set of all 2
concepts over the domain S. Furthermore, S is strongly shattered by C, if C has a cube with dimensions set S. We use s(C) to denote all shattered sets of C and st(C) to denote all strongly shattered sets, respectively. Clearly, both s(C) and st(C) are closed under the subset relation, and
The following theorem is the result of accumulated work by different authors, and parts of it were rediscovered independently several times (Pajor, 1985; Bollob´as and Radcliffe, 1995; Dress, 1997; Anstee et al., 2002).
Theorem 1 (Sandwich Theorem) Let C be a concept class.
This theorem has been discovered independently several times and has several proofs (see (Moran, 2012) for more details). One approach, which is also used in proving the Sauer-Shelah Lemma (Sauer, 1972; Shelah, 1972), is via down-shifting. We now sketch this approach. In a down-shifting step we pick a dimension ), and every
replaced by its x-neighbor (i.e. the concept
which disagrees with c only on x) if the following conditions hold: (i) c(x) = 1, and (ii) the x-neighbour of c does not belong to C. One can easily verify that if
is obtained from C by a down-shifting step, then
). Eventually, after enough down-shifting steps have been performed
the resulting class becomes downward-closed (see Example 2 below). For such classes the cardinality of s(C), st(C), and |C| are all equal. This implies the inequalities in the Sandwich Theorem for the original class.
The inequalities in this theorem can be strict: Let be such that C contains all boolean vectors with an even number of 1
) contains only the empty set and s(C) contains all subsets of {1, . . . , n} of size at most
1. Thus in this example,
The VC dimension (Vapnik and Chervonenkis, 1971; Blumer et al., 1989) is defined as:
Note that by the definition of the VC-dimension:
Hence, an easy consequence of Theorem 1 is that for every concept class C, we have . This is the well-known Sauer-Shelah Lemma (Sauer, 1972; Shelah, 1972).
2.2 Definition of extremal classes and examples
Maximum classes are defined as concept classes which satisfy the Sauer-Shelah inequality with equality. Analogously, extremal classes are defined as concept classes which satisfy the inequalitiesin the Sandwich Theorem with equality: A concept class C is extremal if for every shattered set S of C there is a cube of C with dimension set S, i.e. s(C) = st(C). Note that complementing the bits in a column of the table representing C does not affect the sets s(C), st(C) and extremality is preserved. Also in the one inclusion graph, only the labels of the vertices are affected by such column complementations.
Every maximum class is an extremal class. Moreover, maximum classes of VC dimension d are precisely the extremal classes for which the shattered sets consist of all subsets of the domain of size up to d. The other direction does not hold - there are extremal classes that are not maximum. All the following examples are extremal but not maximum.
Example 1 Consider the concept class C over the domain given in Fig. 1. In this example
This example also demonstrates the cubical structure of extremal classes.
This is simply the hamming ball of radius d around the all 0’s concept. A natural generalization of such classes are downward closed classes. We say that C is downward closed if for all and for all
means that for every
. It is not hard to verify that every downward closed class is extremal.
Example 3 (Hyper-planes arrangements in a convex domain)
Another standard set of examples for maximum classes comes from geometry (see e.g. (Gart- ner and Welzl, 1994)). Let H be an arrangement of hyperplanes in . For each hyperplane
, pick one of half-planes determined by
to be its positive side and the other its negative side. The hyperplanes of
into open regions (cells). Each cell defines a binary mapping with domain H:
It is known that if the hyperplanes are in general position, then the set C of all cells is a maximum class of VC dimension d.
Consider the following generalization of these classes: Let be a convex set. Instead of taking the vectors corresponding to all of the cells, take only those that correspond to cells that intersect K:
is extremal. In fact, for
to be extremal it is not even required that the hyperplanes are in general position. It suffices to require that no d + 1 hyperplanes have a non-empty intersection (e.g. parallel hyperplanes are allowed). Fig. 2 illustrates such a class
the plane. These classes were studied in (Moran, 2012).
Interestingly, extremal classes also arise in the context of graph theory:
Example 4 (Edge-orientations which preserve connectivity (Kozma and Moran, 2013))
Let G = (V, E) be an undirected simple graph and let be a fixed reference orientation. Now an arbitrary orientation of E is a function
oriented as in
is oriented opposite to
fixed vertices, and consider all orientations of E for which there exists a directed path from s to t. The corresponding class of orientations
is an extremal concept class over the domain E.
Figure 2: An extremal class that correspond to the cells of a hyperplane arrangement of a convex set. An arrangement of 4 lines is given which partitions the plane to 10 cells. Each cell corresponds to a binary vector which specifies its location relative to the lines. For example the cell corresponding to 1010 is on the positive sides of lines 1 and 3 and on the negative side of lines 3 and 4. Here the convex set K is an ellipse and the extremal concept class consisting of the cells the ellipse intersects is (the cells 0100, 0111 are not intersected by the ellipse). The class
here has VC dimension 2. Note that it’s shattered sets of size 2 are exactly the pairs of lines whose intersection point lies in the ellipse K.
Moreover, the extremality of this class yields the following result in graph theory: The number of orientations for which there exists a directed path from s to t equals the number of subgraphs for which there exists an undirected path from s to t. For a more thorough discussion and other examples of extremal classes related to graph orientations see (Kozma and Moran, 2013).
Example 5 (A general construction of a maximal class that is extremal but not maximum)
Take a k-dimensional cube and glue to each of its vertices an edge of a new distinct dimension. The resulting class has 2concepts and
dimensions. Let C be the complement of that class.
Claim 1 C is an extremal maximal class of VC dimension which is not maximum.
We prove this claim in Appendix A. Note that and maximum classes of VCdim
dimensions have size 2
the maximum classes of VCdim
larger than the constructed extremal maximal class of VCdim
2.3 Basic properties of extremal classes
Extremal classes have a rich combinatorial structure (See (Moran, 2012) and references within for more details). We discuss some of parts which are relevant to compression schemes.
The following theorem provides alternative characterizations of extremal classes:
Theorem 2 (Bollob´as and Radcliffe (1995); Bandelt et al. (2006)) The following statements are equivalent:
1. C is extremal, i.e. s(C) = st(C).
2. |s(C)| = | st(C)|.
3. |st(C)| = |C|.
4. |C| = |s(C)|.
5. is extremal.
The following theorem shows that the property of “being an extremal class” is preserved under standard operations. It was also proven independently by several authors (e.g. (Bol- lob´as and Radcliffe, 1995; Bandelt et al., 2006)).
Theorem 3 Let C be any extremal class, be any cube such that
are extremal concept classes over the domain
is an extremal concept class over the domain dom(C).
Note that if C is maximum then are also maximum, but
necessarily maximum. This is an example of the advantage extremal classes have over the more restricted notion of maximum classes.
Interestingly, the fact that extremal classes are preserved under intersecting with cubes yields a rather simple proof (communicated to us by Ami Litman) of the fact that every extremal class is “distance preserving”. This property also holds for maximum classes (Gart- ner and Welzl, 1994), however the proof for extremal classes is much simpler than the previous proof for maximum classes (given in (Gartner and Welzl, 1994)):
Theorem 4 (Greco (1998)) Let C be any extremal class. Then for every distance between
in the one-inclusion graph of C equals the hamming distance between
Proof Assume towards contradiction that this is not the case. Among all possible pairs of for which there is no such path, pick a pair
of a minimal hamming distance. Let B be the minimal cube over the domain dom(C) which contains both
. So the dimensions set of
We first claim that by the minimality criteria according to which were chosen, there cannot be any other concept in
. Without loss of generality assume that for all
of entire columns without affecting the distances between concepts). If there was now another concept
) must have at least one 0 and at least one 1. By the minimality according to which
were chosen – there must exist a path between
c of length equal the number of 1’s in c| dim(B). Similarly, there must exist a path between
of length equal the number of 0’s. The combined path would be the length of the Hamming distance between
. So by the minimality
does not contain another concept. Therefore
and this completes the proof of the claim.
Next we observe that by Theorem 3, must be extremal. However we claim that
is not extremal: Since there is no edge between its two concepts
st(
and therefore | st(C)| = 1 < 2 = |C|. This means that
extremal which is a contradition.
The following lemma brings out the special cubical structure of extremal classes. We will use it to prove the correctness of the compression scheme given in the following section. It shows that if are two maximal cubes of an extremal class C then their dimensions sets dim(
) are incomparable.
are two cubes of an extremal class
is maximal, then
Proof Assume towards contradiction that dim(cube
contains the single concept tag(
, and the cube
is a cube of
dimensions set dim(is extremal,
must be connected (by Theorem 4). Therefore there is a path in
between the concept tag(
) and some concept in the cube
. This means there is some edge e incident to the concept
. This edge e is a one-dimensional cube of
labeled with some dimension x. This cube with dimension set dim(e) = {x} expands to a cube of C with dimension set dim(
which contains the cube
. This contradicts the maximality of cube
One concise way to represent an extremal class is as the union of its maximal cubes. With this representation, the extremal class of Fig. 3 is described by the expression
where “+” stands for union. Note that the dimension sets of the cubes are marked as *’s and for the class to be extremal, the dimension sets must be incomparable.
Let C be a concept class. On a high level, a sample compression scheme for C compresses every sample of C to a subsample of size at most k and this subsample represents a hypothesis on the entire domain of C that must be consistent with the original sample. More formally, a labeled compression scheme of size k for C consists of a compression map and a reconstruction map
. The domain of the compression map consists of all samples from concepts in C: For each sample
compresses it to a subsample
of size at most k. The domain of the reconstruction function
is the set of all samples of C of size at most k. Each such sample is used by
to reconstruct a concept h with dom(h) = dom(C). The
sample compression scheme must satisfy that for all samples s of C,
The sample compression scheme is said to be proper if the reconstructed hypothesis h always belongs to the original concept class C.
A proper labeled compression scheme for extremal classes of size at most the VC dimension is given in Algorithm 1. Let C be an extremal concept class and s be a sample of C. In the compression phase the algorithm finds any maximal cube B of C|dom(s) that contains the sample s and compresses s to the subsample determined by the dimensions set of that maximal cube. Note that the size of the dimension set (and the compression scheme) is bounded by the VC dimension.
How should we reconstruct? Consider all concepts of C that are consistent with the sample s:
Correctness means that we need to reconstruct to one of those concepts. Let input for the reconstruction function and let
). During the reconstruction, the domain dom(s) of the original sample s is not known. All that is known at this point is that D is the dimensions set of a maximal cube B of C|dom(s) that contained the sample s. The reconstruction map of the algorithm outputs a concept in the following set:
For the correctness of the compression scheme it suffices to show that for all choices of the maximal cube is non-empty and a subset of
. The following Lemma guarantees the non-emptiness.
Lemma 6 Let C be an extremal class and let be the dimensions set of some cube of C|dom(s). Then D is also the dimensions set of some cube of C.
Proof Clearly the dimension set D is shattered by C|dom(s) and therefore it is also shattered by C. By the extremality of C, D is also strongly shattered by it, and thus there exists a cube B of C with dimensions set D.
The second lemma show that for each choice of the maximal cube
Lemma 7 Let s be a sample of an extremal class C, let B be any maximal cube of C|dom(s) that contains s, and let D denote the dimensions set of B. Then for any cube dim(
, the concept
that is consistent with s on D is also consistent with s on dom(s) \ D.
Figure 3: The one-inclusion graph of an extremal concept class C is given on the left. Consider the sample . There are 4 concepts
consistent with this sample (the octagonal vertices), i.e.
There are 2 maximal cubes of C|dom(s) (graph on right) that contain the sample s (in grey) with dimension sets
, respectively. Let B be the maximal cube with dimension set
. There are 3 cubes of C (on left) with the same dimension set D. Each contains a concept h (shaded grey) that is consistent with the original sample on
and therefore
. For the correctness we need that
nodes on left) is non-empty and a subset of
(octagon nodes on left). Note that in this case
is a strict subset. 14
Proof Since B is a cube with dimensions set D, B|(dom(s)\D) contains the single concept tag(B).
Let be any cube of
be the concept in
consistent with s on D. Now consider the cube
). We will show that
B. This will finish the proof as it shows that both
which means that tag(B) = h|(dom(s) \ D) = s|(dom(s) \ D). Moreover, by the definition of h, h|D = s|D, and therefore h|dom(s) = s as required.
We now show that . Indeed, since
is a cube of C with dimension set
), the cube
) is a cube of C|dom(s) with the same dimension set D. Thus the dimension set of
) contains the dimension set of the maximal cube B of C|dom(s). Therefore, since C|dom(s) is extremal (Theorem 3) it follows by Lemma 5 that
The labeled compression scheme of the previous section compresses each sample of the concept class to a (labeled) subsample and this subsample is guaranteed to represent a hypothesis that is consistent with the entire original sample. Such a labeled compression scheme (of size equal the VC dimension d) was first found for maximum classes. In the previous section, we generalized this scheme to extremal classes.
Alternate “unlabeled” compression schemes have also been found for maximum classes and a natural question is whether these schemes again generalize to extremal classes. As we shall see there is an excellent match between the combinatorics of unlabeled compression schemes and extremal classes. The existence of such schemes remains open at this point. We can however relate their existence to some natural conjectures about extremal classes.
An unlabeled compression schemes compresses a sample s of the concept class C to an (unlabeled) subset of the domain of the sample s. In other words, in an unlabeled compression scheme the labels of the original sample are not used by the reconstruction map. The size of the compression scheme is now the maximum size of the subset that the sample is compressed to. Consider an unlabeled compression scheme for C of size V Cdim(C). For a moment restrict your attention to samples of C over some fixed domain Each such sample is a concept in the restriction C|S. Note that two different concepts in C|S must be compressed to different subsets of S, otherwise if they were compressed to the same subset, the reconstruction of it would not be consistent with one of them. For maximum classes, the number of concepts in C|S is exactly the number of subsets of S of size up to the VC dimension. Intuitively, this “tightness” makes unlabeled compression schemes combinatorially rich and interesting.
Figure 4: Unlabeled compression scheme based on peeling. The vertices of the extremal class C from the left figure of Fig. 3 were peeled in a top down order (See table on left). Note that the highest vertex is always a corner of the remaining extremal class below. The resulting representation sets are underlined in the left table. For example the second concept is represented by the set last to the empty set.
Now consider the domain and a sample
0 over this domain. Partition C into equivalence classes such that concepts in the same class are consistent on D. In the table on the right, we reordered and segmented the concepts of C by their equivalence classes. Each class corresponds to a member of C|D. In Lemma 9 we show that each class contains exactly one concept c such that
(marked with
). Each sample s of C|D is compressed to the unique subset of D in the equivalence class that represents this consistent concept (marked with
). In the reconstruction, each representation set is reconstructed to the concept it represents. In particular the sample s associated with the first class is compressed to
which represents the consistent concept 111110 and “unlabeled sub sample”
is reconstructed to this concept.
Previous unlabeled compression schemes for maximum classes were based on “representation maps”. For maximum classes these are one-to-one mappings between C and subsets of dom(C) of size at most V Cdim(C). Representation maps were used in the following way: Each sample s is compressed to a subset of dom(s) which represents a consistent hypothesis with s, and each subset of size at most V Cdim(C) of dom(C) is reconstructed to the hypothesis it represents. Clearly, not every one-to-one mapping between C and subsets of dom(C) of size at most V Cdim(C) yields an unlabeled compression scheme in this manner, and finding a good representation map (or proving that one exists) became the focus of many previous works.
For maximum classes, representation maps r have been found that map (one-to-one) the concept class C to all subsets of dom(C) of size up to the VC dimension of C. The key combinatorial property that enabled finding representation maps for maximum classes was a “non clashing” condition (Kuzmin and Warmuth, 2007). This property was used to show that for any sample s of C there is exactly one concept c that is consistent with s and ). This immediately implies an unlabeled compression scheme based on non clashing representation maps: Compress to the unique subset of the domain of the sample that represents a concept consistent with the given sample.
We will show below that representation maps naturally generalize to extremal classes: r must now map (one-to-one) the extremal class C to its shattered sets s(C). This is natural since for extremal classes |C| = |s(C)|. We will see that again, if the non clashing condition holds, then for any sample s of C there is exactly one concept c that is consistent with
For maximum classes, such representation maps were first shown to exist via a recursive construction (Kuzmin and Warmuth, 2007). Alternate representation maps were also proposed in (Kuzmin and Warmuth, 2007) based on a certain greedy “peeling” algorithm that iteratively assigns a representation to a concept and removes this concept from the class. The correctness of the representation maps based on peeling was finally established in (Ru- binstein and Rubinstein, 2012). In this section, we show that existence of representation maps based on peeling hinges on certain natural and concise properties of extremal classes. However establishing these conjectured properties of extremal classes remains open.
Representation maps. For any concept class C a representation map is any one-to-one mapping from concepts to subsets of the domain, i.e. )). We say that
). Furthermore we say that two different concepts
with respect to r if they are consistent with each other on the union of their representation sets, i.e.
)). If no two concepts clash then we say that r is non clashing.
Example 6 (Non clashing maps based on disagreements) For an arbitrary concept class be the set of all dimensions on which
i.e.
“reference” concept and define a representation map for class
leave it to the reader to verify that r is non clashing.
Example 7 (A Non clashing representation map for distance preserving classes)
in the one-inclusion graph of C equals to their hamming distance. For every
The representation map has the property that for every
disagree on r(c). To see this, note that since C is isometric then any shortest path from
traverses exactly the dimensions on which
disagrees. In particular, the first edge leaving c in this path traverses a dimension
definition of
we have that
and indeed
disagree on
In fact, this gives a stronger property for distance preserving classes, which is summarized in the following lemma. This lemma will be useful in our analysis.
Lemma 8 Let C be a distance preserving class and let is a teaching set for c with respect to C. That is, for all
Clearly the representation map ) is non clashing. The following lemma establishes that certain non clashing representation maps immediately give unlabeled compression schemes:
Lemma 9 Let r be any representation map that is a bijection between an extremal class C and st(C). Then the following two statements are equivalent:
1. r is non clashing.
2. For every sample s of C, there is exactly one concept that is consistent with s and
Based on this lemma it is easy to see that a representation mapping r for an extremal concept class C defines a compression scheme as follows (See Algorithm 2 and an example in Fig. 4). For any sample s of C we compress s to the unique representative r(c) such that c is consistent with ). Reconstruction is even simpler, since r is bijective: If s is compressed to the set r(c), then we reconstruct r(c) to the concept c.
Note that the representation set r(c) of a concept c is always an unlabeled set from st(C). However, we could also compress to the labeled subsamples c|r(c). It is just that the labels in this type of scheme do not have any additional information and are redundant.
Proof [of Lemma 9]
2 Proof by contrapositive. Assume
. Clearly both
are consistent with
. This negates 2.
1 We will show that 1 implies the following equivalent form of 2: For all sample domains
and samples
, there is exactly one concept
that is consistent with
. Recall that any domain
partitions C into equivalence classes where each class contains all concepts of C consistent with a sample from C|D. We need to show that each equivalence class has a unique concept in
. See Fig. 4 for an example. We split our goal into two parts:
We first prove Part (b). Clearly Furthermore, the non-clashing condition (Part 1 of the lemma) implies that any distinct concepts
disagree on
and therefore |R|D| = |R|.
Since , the set equality R|D = C|D of Part (a) is implied by the fact that both sets have the same cardinality:
For a more detailed proof Assume 2, i.e. there is a sample y of C with dom(y) = Y for which there are either zero or (at least) two consistent concepts
concepts
are consistent with
(which is
1). Assume now that there is no concept c consistent with some sample y of C for which
. Note that
In other words the number of samples consistent with y equals the number of concepts with a representation set in Y . Partition C into equivalence classes where two concepts equivalent if
(See Fig. 4 for a running example). Thus, each equivalence class corresponds to a sample of C with domain Y . Each concept is identified by its representation set r(c) and the number of equivalence classes equals
assumption, all concepts c in the equivalence class of sample
. Therefore by a pigeon hole argument there must be an equivalence class with two distinct concepts
These two concept clash and again
1 is implied.
Once we have a valid representation mapping for some extremal concept class C, we can easily derive a valid mapping for any restriction of the class C|A by compressing every restricted concept. This is discussed in the following corollary.
Corollary 10 For any extremal class is a representation mapping for C then a representation mapping for C|A can be constructed as follows. For any
be the representative of the unique concept
, such that
and
Proof The construction of the mapping for C|A essentially tells us to treat the concept c as a sample from C and to compress it. Thus we can apply Lemma 9 to see that is always uniquely defined. Now we need to show that
satisfies the conditions of the Main Definition. Since the representatives
) are subsets of A, the non-clashing property for the representation mapping
follows from the non-clashing condition for r for C. The bijection property follows from a counting argument like the one used in the proof of Lemma 9, since
Corner peeling yields good representation maps. We now present a natural conjecture concerning extremal classes and show how this conjecture can be used to construct non clashing representation maps. A concept c of an extremal class C is a corner of C if C \ {c} is extremal. By Lemma 5 we have that for each ) there is at most one maximal cube with dimension set S and if S is the dimensions set of a non-maximal cube, then there are at least two cubes with this dimension set. Therefore
For C \ {c} to be extremal, 1 (by Theorem 2) and therefore c is a corner of an extremal class C iff c lies in exactly one maximal cube of C.
Conjecture 11 Every non empty extremal class C has at least one corner.
In (Kuzmin and Warmuth, 2007) essentially the same conjecture was presented for maximum classes. For these latter classes, the conjecture was finally proved in (Rubinstein and Rubinstein, 2012). This conjecture also has been proven for other special cases such as extremal classes of VC dimension at most 2 (Litman and Moran, 2012; M´esz´aros and R´onyai, 2014). In fact Litman and Moran (2012) proved a stronger statement: For every two extremal classes 2, there exists an extremal class
is a strict subset of
and a strict superset of
). Indeed, this statement is stronger as by repeatedly picking a larger extremal class
eventually a
is obtained such that
is extremal. For general extremal classes this stronger statement also remains open.
Conjecture 12 For every two extremal classes there exists an extremal class
Let us return to the more basic Conjecture 11. How does this conjecture yield a representation map? Define an order
such that for every is a corner of
, and define a map
such that
is the unique maximal cube of
belongs to. We claim that r is a representation map. Indeed, r is a one-to-one mapping from C to st(C) (and since C is extremal r is a bijection). To see that r is non clashing, note that
is extremal and therefore distance preserving (Theorem 4). Thus, Lemma 8 implies that
) is a teaching set of
with respect to
. This implies that r is indeed non clashing.
We studied the conjecture of Floyd and Warmuth (1995) which asserts that every concept classes has a sample compression scheme of size linear in its VC dimension. We extended the family of concept classes for which the conjecture is known to hold by showing that every extremal class has a sample compression scheme of size equal to its VC dimension. We discussed the fact that extremal classes form a natural and rich generalization of maximum classes for which the conjecture had been proved before (Floyd and Warmuth, 1995).
We further related basic conjectures concerning the combinatorial structure of extremal classes with the existence of optimal unlabeled compression schemes. These connections may also be used in the future to provide a better understanding on the combinatorial structure of extremal classes, which is considered to be incomplete by several authors (Bol- lob´as and Radcliffe, 1995; Greco, 1998; R´onyai and M´esz´aros, 2011).
Our compression schemes for extremal classes yield another direction of attacking the general conjecture of Floyd and Warmuth: it is enough to show that an arbitrary maximal concept class of VC dimension d can be covered by exp(d) extremal classes of VC dimension O(d). Note it takes additional O(d) bits to specify which of the exp(d) extremal classes is used in the compression.
Acknowledgements We thank Micha�l Derezi´nski for a good feedback on the writing of the paper and Ami Litman for helpful combinatorial insights.
R.P. Anstee, L. R´onyai, and A. Sali. Shattering news. Graphs and Combinatorics, 18(1): 59–73, 2002.
H.J. Bandelt, V. Chepoi, A.W.M. Dress, and J.H. Koolen. Combinatorics of lopsided sets. Eur. J. Comb., 27(5):669–689, 2006.
S. Ben-David and A. Litman. Combinatorial variability of Vapnik-Chervonenkis classes with applications to sample compression schemes. Discrete Applied Mathematics, 86(1): 3–25, 1998. doi: 10.1016/S0166-218X(98)00000-6. URL http://dx.doi.org/10.1016/ S0166-218X(98)00000-6.
A. Blumer, A.j Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Inf. Process. Lett., 24(6):377–380, 1987. doi: 10.1016/0020-0190(87)90114-1. URL http: //dx.doi.org/10.1016/0020-0190(87)90114-1.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik- Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. ISSN 0004-5411. doi: 10.1145/76359.76371.
B. Bollob´as and A. J. Radcliffe. Defect Sauer results. J. Comb. Theory, Ser. A, 72(2): 189–208, 1995.
B. Bollob´as, A. J. Radcliffe, and Leader I. Reverse Kleitman inequalities. Proc. London Math. Soc., Ser. A, (3) 58:153–168, 1989.
A. Chernikov and P. Simon. Externally definable sets and dependent pairs. Israel Journal of Mathematics, 194(1):409–425, 2013.
A. Daniely and S. Shalev-Shwartz. Optimal learners for multiclass problems. In COLT, pages 287–316, 2014. URL http://jmlr.org/proceedings/papers/v35/daniely14b. html.
T. Doliwa, H.-U. Simon, and S. Zilles. Recursive teaching dimension, learning complexity, and maximum classes. In ALT, pages 209–223, 2010. doi: 10.1007/978-3-642-16108-7 19. URL http://dx.doi.org/10.1007/978-3-642-16108-7_19.
A.W.M. Dress. Towards a theory of holistic clustering. DIMACS Ser. Discrete Math. Theoret. Comput. Sci., 37 Amer. Math. Soc.:271–289, 1997.
S. Floyd. Space-bounded learning and the Vapnik-Chervonenkis dimension. In COLT, pages 349–364, 1989. ISBN 1-55860-086-8. URL http://dl.acm.org/citation.cfm? id=93379.
S. Floyd and M. K. Warmuth. Sample compression, learnability, and the VapnikChervonenkis dimension. Machine Learning, 21(3):269–304, 1995. doi: 10.1007/ BF00993593. URL http://dx.doi.org/10.1007/BF00993593.
Y. Freund. Boosting a weak learning algorithm by majority. Inf. Comput., 121(2):256–285, 1995. doi: 10.1006/inco.1995.1136. URL http://dx.doi.org/10.1006/inco.1995. 1136.
Y. Freund and R. E. Schapire. Boosting: Foundations and Algorithms. Adaptive computation and machine learning. MIT Press, 2012. ISBN 9780262017183. URL http://books.google.co.il/books?id=blSReLACtToC.
B. Gartner and E. Welzl. Vapnik-Chervonenkis dimension and (pseudo-)hyperplane ar- rangements. Discrete and Computational Geometry (DCG), 12:399–432, 1994.
G. Greco. Embeddings and the trace of finite sets. Inf. Process. Lett., 67(4):199–203, 1998.
D. Haussler, N. Littlestone, and M.K. Warmuth. Predicting \0,1\-functions on randomly drawn points. Inf. Comput., 115(2):248–292, 1994. doi: 10.1006/inco.1994.1097. URL http://dx.doi.org/10.1006/inco.1994.1097.
D. P. Helmbold, R. H. Sloan, and M. K. Warmuth. Learning integer lattices. SIAM J. Comput., 21(2):240–266, 1992. doi: 10.1137/0221019. URL http://dx.doi.org/10. 1137/0221019.
L. Kozma and S. Moran. Shattering, graph orientations, and connectivity. Electr. J. Comb., 20(3):P44, 2013. URL http://www.combinatorics.org/ojs/index.php/eljc/ article/view/v20i3p44.
D. Kuzmin and M. K. Warmuth. Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research, 8:2047–2081, 2007. URL http://dl.acm.org/ citation.cfm?id=1314566.
J. Lawrence. Lopsided sets and orthant-intersection by convex sets. Pac. J. Math., 104(1): 155–173, 1983.
A. Litman and S. Moran. Unpublished results. 2012.
N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished, 1986.
R. Livni and P. Simon. Honest compressions and their application to compression schemes. In COLT, pages 77–92, 2013. URL http://jmlr.org/proceedings/papers/ v30/Livni13.html.
T. M´esz´aros and L. R´onyai. Shattering-extremal set systems of VC dimension at most 2. Electr. J. Comb., 21(4):P4.30, 2014. URL http://www.combinatorics.org/ojs/ index.php/eljc/article/view/v21i4p30.
S. Moran. Shattering-extremal systems. CoRR, abs/1211.2980, 2012. URL http://arxiv. org/abs/1211.2980.
S. Moran, A. Shpilka, A. Wigderson, and A. Yehudayoff. Teaching and compressing for low VC-dimension. ECCC, TR15-025, 2015.
Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes. J. ACM, 63(3):21:1–21:10, June 2016. ISSN 0004-5411. doi: 10.1145/2890490. URL http://doi. acm.org/10.1145/2890490.
A. Pajor. Sous-espaces des espaces de banach. Travaux en Cours. Hermann, Paris, 1985.
L. R´onyai and T. M´esz´aros. Some combinatorial applications of Gr¨obner bases. In CAI, pages 65–83, 2011.
B. I. P. Rubinstein and J. H. Rubinstein. A geometric approach to sample compression. Journal of Machine Learning Research, 13:1221–1261, 2012. URL http://dl.acm.org/ citation.cfm?id=2343686.
B. I. P. Rubinstein, P. L. Bartlett, and J. H. Rubinstein. Shifting: One-inclusion mistake bounds and sample compression. J. Comput. Syst. Sci., 75(1):37–59, 2009. doi: 10.1016/ j.jcss.2008.07.005. URL http://dx.doi.org/10.1016/j.jcss.2008.07.005.
R. Samei, B. Yang, and S. Zilles. Generalizing labeled and unlabeled sample compression to multi-label concept classes. In ALT, pages 275–290, 2014. ISBN 978-3-319-11661-7. doi: 10.1007/978-3-319-11662-4 20. URL http://dx.doi.org/10.1007/ 978-3-319-11662-4_20.
N. Sauer. On the density of families of sets. J. Comb. Theory, Ser. A, 13:145–147, 1972.
S. Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pac. J. Math., 41:247–261, 1972.
L.G. Valiant. A theory of the learnable. Commun. ACM, 27:1134–1142, 1984. ISSN 0001-0782. doi: 10.1145/1968.1972.
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probability. Theory Probab. Appl., 16:264–280, 1971.
M. K. Warmuth. Compressing to VC dimension many points. In COLT/Kernel, pages 743– 744, 2003. doi: 10.1007/978-3-540-45167-9 60. URL http://dx.doi.org/10.1007/ 978-3-540-45167-9_60.
To prove this claim use the following simple fact.
Let
be two complementing concept classes over domain X. Then for every
exactly one of the following holds.
1. C strongly shatters Y .
2. ¯
With this lemma at hand, note that if C is extremal then for every
Going back to our C from the construction, it is easy to verify that ¯C (and therefore C) is extremal, because glueing an edge of a new dimension to a concept of an extremal class preserves extremality. Thus, by the above lemma 2 (because every subset of size 1 is strongly shattered by ¯C but there are subsets of size 2 that are not strongly shattered by ¯C). To see why C is maximal we again use the above lemma and observe that every concept ¯c which is removed from ¯C removes a set of size 1 (the set containing the unique dimension of the edge glued to ¯c) from the strongly shattered sets of ¯C. This means that a set of size
1 is added to the shattered sets of C and the VC dimension of C is increased from