Nested Barycentric Coordinate System as an Explicit Feature Map

2020·Arxiv

Abstract

Abstract

We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices’ barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods.

1 Introduction

Kernel methods provide two principal benefits: (1) They implicitly induce a non-linear feature map, which allows for a richer space of classifiers and (2) when the kernel trick is available, they effectively replace the dimension d of the feature space with the sample size n as the computational complexity parameter. As such, these are well-suited for the ‘high dimension, moderate data size’ regime. For very large datasets, however, naive use of kernel methods becomes prohibitive. The cost is incurred both at the training stage, where an optimal classifier is searched for over an n-dimensional space, and at the hypothesis evaluation stage, where a sum of n kernel evaluations must be computed.

For these reasons, for large data sets, explicit feature maps are sometimes preferred. Various approximations have been proposed to mitigate the computational challenges associated with explicit feature maps, including Chang et al. (2010); Maji et al. (2012); Perronnin et al. (2010); Rahimi and Recht (2007); Vedaldi and Zisserman (2012); Li et al. (2010); ?); ?); ?.

Our contribution. We propose a new embedding method which is well-suited for the large sample regime. Our technique consists of partitioning the space into a nested hierarchy of simplices, and then embedding each data point into features corresponding to the barycentric coordinates of the simplex that contains it. We then train a linear classifier in the rich feature space obtained from the simplices. For sample size n in d-dimensional space, our algorithm has runtime regardless of the dimension of the embedding space (when the approximation parameter is taken to be fixed, see Sections 4 and 5). In contrast, standard kernelized SVM has a runtime .

Additionally, our embedding technique allows for highly non-linear decision boundaries, although these are linear within each simplex (and hence piecewise-linear overall), as explained in Section 2. At the same time, our approach is sufficiently robust to closely approximate realizable convex bodies – in fact, multiple such bodies – in only linear time in fixed dimension (Section 3). We also give generalization bounds based on empirical margin (Theorem 3) and a novel hybrid sample compression technique (Theorem 4). Finally, we perform an extensive empirical evaluation, in which our method consistently outperforms other explicit feature map classification methods, including a range of popular kernel embedding methods (Section 5).

1.1 Related Work

Kernel approximations for explicit feature maps come in two basic varieties: data-independent approximations to fixed kernels, and data-dependent feature maps.

Data-dependent kernel approximations. This category includes Nystrom’s approximation (Williams and Seeger, 2000), which projects the data onto a suitably selected subspace. If is the projection of example x onto the basis element , the points are chosen to maximally capture the data variability. Some methods select from the sample. The selection can be random (Williams and Seeger, 2001), greedy (Smola and Schökopf, 2000), or involve an incomplete Cholesky decomposition (Fine and Scheinberg, 2001). Perronnin et al. (2010) applied Nystrom’s approximation to each dimension of the data independently, greatly increasing the efficiency of the method.

Data-independent kernel approximations. This category includes sampling the Fourier domain to compute explicit maps for translation invariant kernels. Rahimi and Recht (2007, 2009) do this for the radial basis function kernel, also known as Random Kitchen Sinks. Li et al. (2010); Vedaldi and Zisserman (2012) applied this technique to certain group-invariant kernels, and proposed an adaptive approximation to the kernel. Porikli and Ozkan (2011) map the input data onto a low-dimensional spectral (Fourier) feature space via a cosine transform. Vempati et al. (2010) proposed a skewed chi squared kernel, which allows for a simple Monte Carlo approximation of the feature map. Maji et al. (2012) approximated the intersection kernel and the kernel by a sparse closed-form feature map. Pele et al. (2013) suggested using not only piecewise linear function in each feature separately but also to add all pairs of features. Chang et al. (2010) conducted an extensive study on the usage of the second-order polynomial explicit feature map. Bernal et al. (2012) approximated second order features relationships via a Conditional Random Field model.

Decompositions and other SVM approaches. Simplex decompositions have been used to produce proximity-based classifiers (Belkin et al., 2018; Davies, 1996), but to the best of our knowledge, ours is the first work to utilize either nested simplex decompositions or barycentric centers in conjunction with SVM. Simplex decompositions are related to the quadtree, and the quadtree has been used together with SVM for various learning tasks (Saavedra et al., 2004; Beltrami and da Silva, 2015), but not for the creation of a kernel embeddings. Simplex decompositions are more efficient than quadtrees, since a simplex naturally decomposes into only d+1 sub-simplices (Section 2), while a quadtree cell naturally decomposes into sub-cells.

As mentioned, our emphasis in this paper is specifically on explicit feature maps, but there are numerous approaches to reducing kernel SVM runtime (for example the CoreSVM of Tsang et al. (2005, 2007)). Another related paradigm is that of Local SVM (Hao Zhang et al., 2006; Gu and Han, 2013), which assumes continuity of the labels with respect to spacial proximity; similarly labeled points tend to cluster together. This differs from the underlying assumption motivating kernel SVM, which assumes that the data is approximately linearly separable, but not necessarily clusterable. These approaches find success in distinct settings, and are incomparable.

Approximating convex polytopes. Learning arbitrary convex bodies requires very large sample size (Goyal and Rademacher, 2009), and so we focus instead on convex polytopes defined by a small number of halfspaces. However, the problem of finding consistent polytopes is known to be NP-complete even when the polytope is simply the intersection of two hyperplanes (Megiddo, 1988). In fact, Khot and Saket (2011) showed that “unless NP = RP, it is hard to (even) weakly PAC-learn intersection of two halfspaces”, even when allowed the richer class of O(1) intersecting halfspaces. Klivans and Sherstov (2009) showed that learning an intersection of halfspaces is intractable regardless of hypothesis representation (under certain cryptographic assumptions). These negative results have motivated researchers to consider the problem of discovering consistent polytopes which have some separating margin. Several approximation and learning algorithms have been suggested for this problem, featuring bounds with steep dependence on the inverse margin and number of halfspaces forming the polytope (Arriaga and Vempala, 2006; Klivans and Servedio, 2008; Gottlieb et al., 2018; Goel and Klivans, 2018).

In contrast, we show in Section 3 that our method is capable of approximating any convex polytope in linear time (in fixed dimension), independent of the halfspace number and with only logarithmic dependence on the inverse margin. It accomplishes this by finding a linear separator in the higher-dimensional embedded space, and projecting the solution back into the origin space. However, our approach is not strictly comparable to those above, as they are concerned with minimizing the disagreement between the computed polytope (or object) and the true underlying polytope with respect to the point space, while we minimize the volume of the space between them.

2 The barycentric coordinate system

Here we describe the nested barycentric coordinate system embedding. We explain its construction and description, how to embed a point from the origin space into the new coordinate system, and how a point in the embedded system can be projected back into the origin space (Section 2.1). We then show that if we associate a weight with each simplex point, then the embedding and weights together imply some (not necessarily convex) polytope on the origin space (Section 2.2). Later in Section 3, we will show that this system is sufficiently robust that it can be used to approximate any convex body.

2.1 Nested barycentric embedding

Let be a regular simplex of unit side-length, and let vertices. Each point x inside the simplex can be written using the barycentric

coefficients:

Here denotes the coefficient of point corresponding to vertex . Let the ordered vector of ’s, , corresponding to be denoted as . If we artificially augment the original feature space by adding another feature which equals to 1, i.e. and define the matrix , then the transformation is a linear transformation of the form .

We can further refine the system by introducing a new point inside the simplex, thereby splitting the simplex into d + 1 new sub-simplices. We order the coordinates of our system as . A point inside the system is embedded by first utilizing the d + 1 vertices of its surrounding simplex to compute the barycentric coefficients (the ’s) of equation 1. Then is assigned a vector wherein a coordinate corresponding to one of these d + 1 simplex vertices is set to the coefficient of that vertex, and all other coordinates are set equal to 0. This defines the embedding .

The refinement process can be continued by choosing points inside simplices to further split these simplices. We define the nested architecture and its associated embedding to be the coordinates constructed from by concatenating a new point to the previous coordinate system. Each point is embedded using the barycentric coefficients of the vertices of the simplex surrounding point x and by assigning those coefficients in the index of the corresponding vertices and by assigning zero to all other vertices. We note that the embedding — the nested barycentric coordinate system — is sparse, as only d + 1 coefficients are non-zero, and also that the embedded points lie on the sphere ().

A point in the embedded space can be projected back into the origin space by utilizing the identity

2.2 Weights, hyperplanes and polytopes

Given an embedding, we will assign a set of weights to the vertices Then the set of points R such that:

is a union of regions whose boundaries are unions and intersections of hyperplanes. R can represent a polytope as well as the union of several disjoint

Fig. 1: Creation of the nested system about a convex polytope

polytopes, each of which is not necessarily convex (see Figure 2 for an illustration):

Lemma 1 Any hyperplane that crosses a single simplex can be defined by a set of weights , such that all points that lie on the hyperplane satisfy the equation . Further, R is a union of regions whose boundaries are unions and intersections of hyperplanes, where each simplex contains at most one hyperplane.

Proof Choose d linearly independent points on the given hyperplane. Since these points are inside the coordinate system, they have unique barycentric coefficients and thus a unique representation. Finding these weights is equivalent to solving which is a matrix of dimension rows are the embeddings of the points. This is a homogeneous linear system and so the w’s are unique up to a scaling factor. Every point on the hyperplane is a linear combination of those d linearly independent points and thus also satisfies the equation . Likewise, every set of weights represents at most one hyperplane crossing the system.

In Section 3, we will show that a simple nested barycentric system, together with a prudent choice of weights, can be used to closely approximate any given convex body. To this end, we will require a useful property of these systems — essentially, that splitting a simplex cannot decrease the expressiveness of the system. Recall that a barycentric system is defined by an ordered set of points; we will say that is a prefix of

Theorem 1 Let be a nested barycentric system, with as its coordinates, and let be a nested barycentric system such that

Fig. 2: An example of an open polytope and two disjoint polytopes as a nested barycentric system

P be a polytope described in as:

then there exist a set of weights such that P can also be described by . Further, w is a prefix of .

In order to prove the theorem we must first demonstrate the relationship between coefficients before and after a simplex split.

Definition 1 Let be a nested barycentric system, with , Let the new splitting point be . Since , it can be written as:

We define to be the coefficients of the new coordinate of step t + 1 using the coordinate system at step t.

Lemma 2 For a given data point the connection between the coefficients of step t and t + 1 is:

For the simplicity of the notation in this proof we will use instead of , and instead of .

Proof The data point at step t can be written as:

The point at step t + 1 is written as:

Combining equations 5 and 8, we derive:

We can now prove theorem 1 by induction:

Proof For a given polytope P with a set of weights at system , such that , and , we choose the set of weights for as follows: The first t weights of are the same as for ), and . Then the scaled distance of every given point represented in from the hyperplane , is the same as the scaled distance of the point represented in , and specifically the polytope P remains the same. Using Lemma 2 we have:

3 Approximating a convex body

In this section we show that the nested barycentric coordinate system (NBCS) can represent an arbitrarily close approximation to any convex body. As stated the NBCS produces a (not necessarily convex) piece-wise linear classifier. In fact, this method can approximate multiple convex bodies. For simplicity, we focus on the case of a single convex body, and demonstrate how our method approximates it. This will be done by placing split points at the barycenters of their containing simplices, where the barycenter of a simplex with vertices is given by .

In order to state our result formally, we introduce some notation: Given a point and a parameter , let be the ball of radius centered at p. Given a set , let be the set of all points of X that are at distance at least from the boundary of X. Recall that S denotes the unit simplex.

be a given convex body of diameter 1, and let be given. Then there exists a nested system , obtained by always placing split points at the barycenters of their containing simplices, and a corresponding set of weights w, such that

satisfies the following:

Proof The construction proceeds in stages i = 0, 1, . . . , s. (Below, we will take .) At stage 0 the only points present are the vertices of S. At each stage , a new split point is placed at the barycenter of each existing simplex, and the final construction is called the s-stage uniform subdivision of be the set of simplices present at stage i, and clearly Note that all simplices in have the same volume.

The weights are assigned as follows: Initially, vertices assigned weights . At each stage , each new split point is given the smallest possible weight that ensures is given by (11). Once a weight is assigned to a point, it is never changed again. In other words, for those points of that already belonged to , their weights at are the same as their weights at .

Let be a simplex with vertices and weights respectively. Let be the barycenter of . By Theorem 1, if is assigned weight remains unchanged. Hence, the weight that will be assigned to by our construction will satisfy . And therefore, at each stage, only shrinks. If at stage i a certain simplex , then at stage i + 1 the barycenter of will be assigned weight , so that the interior of lie completely outside of .

Let us denote by the region produced by this construction after stage s. (See Figure 3 for an illustration in the plane.) We will now prove that, if s is made large enough, then approximates the given convex body P arbitrarily well, as stated in the theorem.

The diameter of a compact subset of is the maximum distance between two points in the set. In particular, the diameter of a simplex is the largest distance between two vertices of the simplex.

Fig. 3: Four stages of the approximation of a given convex polygon in the plane.

Lemma 3 Let be a simplex with vertices , let c be the diameter of , and let q be the barycenter of . Then the distance between q and any vertex is at most cd/(d + 1).

Proof Fix for concreteness. Then, under the constraints for , the distance between q and is maximized by letting for all , which yields the claimed distance.

be a simplex with diameter c. Let A be the collection of the simplices obtained by a d-stage uniform subdivision of . Then there are at least (d + 1)! simplices in A with diameter at most cd/(d + 1).

Proof By Lemma 3, every simplex in A that contains at most one vertex of will have diameter at most cd/(d + 1). Each time a simplex is subdivided into d + 1 simplices by an interior point q, the new simplices share only d of their vertices with . Hence, at stage 1 of the subdivision of , there are d + 1 simplices that share only d vertices with , there are (d + 1)d simplices that share only vertices with ; and so on, until at stage d there are simplices that share only one vertex with

Recall that denotes the collection of simplices present in the i-stage uniform subdivision of S.

Lemma 5 Let k, z be integers, and set s = zkd. Then at most afraction of the simplices in have diameter larger than .

Proof By repeated application of Lemma 4. After kd stages, at most an -fraction of the simplices in have diameter larger than d/(d + 1), for . All the other simplices have diameter at most d/(d + 1). Of the latter simplices, after kd more stages, at most an -fraction of their descendants have diameter larger than . Hence, in , the fraction of simplices with diameter larger than is at most . And so on. In , the fraction of simplices with diameter larger than at most . Since for all d, the claim follows.

Now, given minimally so that , and then choose k minimally so that we have for some c.) Let be the region surrounding P that is at distance at most from P, and let be the union of all the simplices in with diameter larger than . By the choice of s, every point in belongs to . Let us bound each of and .

As (keeping P fixed) we have . Furthermore, P and S are both convex with , so . Since where is a regular simplex of unit side-length, we have and . Hence, by the choice of , we have . By Lemma 5, we also have . Hence, , and the first item follows.

For the second item, by construction . Now given a parameter , apply the first part of the theorem with , where is the volume of a d-dimensional ball of radius . (A calculation shows that , so it suffices to take for an appropriate constant .) Suppose for a contradiction that there exists a point that is outside of P. Then the ball is contained in . But since P is convex, more than half of B is outside of P. Hence, , contradicting the first part of the theorem. This implies that , concluding the second item and the proof of Theorem 2.

4 Learning algorithms

In Section 3, we demonstrated that the uniform subdivision embedding, coupled with an appropriate choice of weights, can represent an approximation to any given convex body. This motivates an embedding technique for a linear classifier.

For some parameter q (determined by cross validation), our classification algorithm produces a q-stage uniform subdivision: Beginning with a single simplex covering the entire space, at each stage we add to the system the barycentric center of each simplex, thereby splitting all simplices into d + 1 sub-simplices. We call a set of d + 1 simplices formed by a split siblings. The procedure stops after q stages, having produced simplices. We note that there is nothing to be gained by splitting an empty simplex, so the algorithm may ignore these; then an empty simplex must have a sibling that contains points, and since a simplex has d siblings, we have that the total number of simplices is not greater than . Parameter q is analogous to depth parameter s of Lemma 5; however, we have consistently observed by empirical cross-validation that it suffices to take q as a very small constant (at most 5), and so we stipulate in our algorithm that q be bound by a small universal constant.

Having computed the nested coordinate system, we use it to embed all points into high-dimensional space. To find an appropriate weight assignment w for the simplex points, we compute a linear classifier on the embedded space to separate the data. A linear classifier takes the form h(x) = signthis w serves as our weight vector for the embedding. We use soft SVM as our linear classifier, and note that the training phase can be executed in time O(dn) on (d + 1)-sparse vectors (Joachims, 2006). The total runtime of the algorithm is bounded by the cost of executing the sparse SVM plus the total number of simplex points, that is

To classify a new point, we can simply search top-bottom for its lowest containing simplex: We begin at the initial simplex, investigate which of its d sub-simplices contains the query point, and iterate on that simplex. This can all be done in time . After bounding the run time, we want to bound the out of sample error:

Theorem 3 If our classifier achieves sample error with margin (i.e., is the fraction of the points whose margin is less than ) on a sample of size n after stopping at stage q, its generalization error R is bounded by

with probability at least .

This bound is a consequence of the SVM margin bound (Mohri et al., 2012, Theorem 4.5) and the stratification technique (Shawe-Taylor et al., 1998), where the q-th stage receives weight .

Adaptive splitting strategies. The above algorithm is data-independent in its selection of split points. It is reasonable to suggest that a data-dependent choice of split points can improve the performance of the learning algorithm. Several greedy strategies suggest themselves, but after empirical trials we suggest the following split heuristic: At every stage, a linear classifier of the embedding space is computed. For each simplex, we identify the points in the simplex have been misclassified so far, and choose a data point which is closest to the the barycentric center of the misclassified points. As before, we limit the heuristic to a constant number of stages, and it is also not necessary to subdivide an empty simplex, or one that contains not many misclassified points. (See Section 5 for empirical results.) The following bounds follow from Corollary 2:

Theorem 4 If our adaptive classifier achieves sample error with margin (i.e., is the fraction of the points whose margin is less than ) on a sample of size n after stopping at stage q and retaining k split points, its generalization error R is bounded by

with probability at least .

5 Experiments

Our embedding technique is motivated by provable bounds for convex polytopes, but we find that it is sufficiently robust to yield impressive empirical results

Fig. 4: Learning a polytope separating the red and blue points.

for non-convex polytopes or even general point sets. All experiments utilized the python scikit-learn library (Pedregosa et al., 2011) . The regularization parameter C was 5-fold cross-validated over the set , and for the RBF kernel, the parameter was five-fold cross-validated over the set . For our methods, the maximum iteration parameter q was cross validated over the set {2, . . . , 5}. Our algorithms usually converged even before reaching the maximum number of allowed iterations.

Non-convex polytope approximation. Before presenting the experiments, we give a simple example that illustrates the power of our approach in approximating non-convex polytopes. We created a random data-set wherein all positive examples were taken from within a 5-gon and the negative points from outside it. This data was randomly generated within the unit circle: Each vector was sampled from the unit sphere and then normalized by , where random variable is sampled independently at random for each vector. We then sampled 5 halfspaces whose intersection formed the target polytope: For each halfspace, we sampled a random direction vector uniformly from the unit sphere, and then sampled a random offset value to produce the halfspace . The intersection of these halfspaces is the target polytope. All data points inside the polytope with margin 0.05 were labeled as positive, all data points outside the polytope with margin 0.05 were labeled as negative, and the rest were discarded.

Figure 4 shows the iterative boundary formation, where the bold black line is the decision boundary and the dotted lines are the margin (For each iteration, the nested barycentric system is illustrated by the red lines. A consistent approximation of the underlying polytope for multiple runs was achieved after only 3 iterations. Notice how the margins become smaller at each iteration until reaching their predetermined size.

Benchmarks. We first compared the runtime and accuracy of our methods in Section 4 – uniform subdivision with NBCS (uni-NBCS) and adaptive splitting with NBCS (adapt-NBCS) – to the 2and 3degree polynomial explicit feature maps, and to the RBF kernel SVM. We used CoreSVM for the RBF kernel, as the LibSVM RBF failed to run on very large datasets. We considered large datasets from LibSVM Machine Learning repository (Chang and Lin, 2011), taking random splits averaged over 10 random trials. In the LibSVM implementation, the runtime of 2degree SVM is and 3degree SVM is . We implemented our algorithm to run in Table 1 shows a summary of our experimental results, and demonstrates that our method compares favorably to the others both in runtime and accuracy. We believe that this is due to NBCS embedding the data into a small but yet very expressive space. We further compared our technique to other explicit feature map methods. Here we focused on accuracy as opposed to runtime, since all these methods have similar runtime complexity, Figure 5 demonstrates a comparison of the average accuracy between our embedding technique (adaptNBCS), Kitchen Sink (KS) (Rahimi and Recht, 2007), Nystrom’s approximation (Williams and Seeger, 2000) and the adaptive (Vedaldi and Zisserman, 2012), all of which have open source implementations, over a large variety of medium sized datasets. We also included the accuracy achieved by RBF CoreSVM. Again, our algorithm’s accuracy compared favorably with the others.

6 Discussion and future work

In this paper, we introduced the barycentric coordinate system embedding, demonstrated its computational power, and suggested implementation techniques. We derived a statistical foundation for this approach, and presented experiments on LibSVM datasets which show promising empirical results. This method is advantageous in the large data and small to medium feature size regime. Future work includes analytical and empirical investigations of other natural splitting strategies.

Table 1: UCI Datasets

Fig. 5: classification results for different embedding techniques

References

Anthony M, Bartlett PL (1999) Neural Network Learning: Theoret- ical Foundations. Cambridge University Press, Cambridge, DOI 10.1017/CBO9780511624216, URL http://dx.doi.org/10.1017/ CBO9780511624216

Arriaga RI, Vempala S (2006) An algorithmic theory of learning: Robust concepts and random projection. Machine Learning 63(2):161–182, URL https://doi.org/10.1007/s10994-006-6265-7

Beltrami M, da Silva ACL (2015) Grid-quadtree algorithm for support vector classification parameters selection. Appl Math Sci 9:75–82

Bernal A, Crammer K, Pereira F (2012) Automated gene-model curation using global discriminative learning. Bioinformatics

Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Chang Y, Hsieh C, Chang K, Ringgaard M, Lin C (2010) Training and testing low-degree polynomial data mappings via linear SVM. JMLR

Fine S, Scheinberg K (2001) Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2:243–264, URL http://www.jmlr.org/papers/v2/fine01a.html

Goel S, Klivans A (2018) Learning neural networks with two nonlinear layers in polynomial time (arxiv:1709.06010v4)

Gottlieb L, Kaufman E, Kontorovich A, Nivasch G (2018) Learning convex polytopes with margin. In: NeurIPS, pp 5711–5721

Goyal N, Rademacher L (2009) Learning convex bodies is hard, arxiv:0904.1227

Gu Q, Han J (2013) Clustered support vector machines. In: Artificial Intelligence and Statistics, pp 307–315

Hanneke S, Kontorovich A (2019) A sharp lower bound for agnostic learning with sample compression schemes. In: ALT

Hao Zhang, Berg AC, Maire M, Malik J (2006) Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 2126–2136

Joachims T (2006) Training linear SVMs in linear time. In: KDD

Khot S, Saket R (2011) On the hardness of learning intersections of two halfspaces. J Comput Syst Sci 77(1):129–141, URL https://doi.org/10. 1016/j.jcss.2010.06.010

Klivans AR, Servedio RA (2008) Learning intersections of halfspaces with a margin. J Comput Syst Sci 74(1):35–48, URL https://doi.org/10.1016/ j.jcss.2007.04.012

Klivans AR, Sherstov AA (2009) Cryptographic hardness for learning in- tersections of halfspaces. J Comput Syst Sci 75(1):2–12, URL https: //doi.org/10.1016/j.jcss.2008.07.008

Li F, Ionescu C, Sminchisescu C (2010) Random Fourier Approximations for Skewed Multiplicative Histogram Kernels, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 262–271. URL https://doi.org/10.1007/ 978-3-642-15986-2_27

Maji S, Berg A, J M (2012) Efficient classification for additive kernel SVMs. PAMI

Megiddo N (1988) On the complexity of polyhedral separability. Discrete & Computational Geometry 3(4):325–337, URL https://doi.org/10.1007/ BF02187916

Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations Of Machine Learning. The MIT Press

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830

Pele O, Taskar B, Globerson A, Werman M (2013) The pairwise piecewise-linear embedding for efficient non-linear classification. In: ICML

Perronnin F, Senchez J, et al. (2010) Large-scale image categorization with explicit data embedding. In: CVPR

Porikli F, Ozkan H (2011) Data driven frequency mapping for computationally scalable object detection. In: 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 30–35

Rahimi A, Recht B (2007) Random features for large-scale kernel machines. NIPS

Rahimi A, Recht B (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in Neural Information Processing Systems 21, Curran Associates, Inc., pp 1313–1320

Saavedra E, Grauel A, Morton D (2004) Support vector machines and quad- trees applied to image compression. In: Proceedings of the 6th Nordic Signal Processing Symposium-NORSIG, Citeseer, vol 2004

Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44(5):1926–1940

Smola AJ, Schökopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the Seventeenth International Conference on

Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’00, pp 911–918, URL http://dl.acm.org/citation.cfm?id= 645529.657980

Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6(Apr):363–392

Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on Machine learning, ACM, pp 911–918

Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. PAMI

Vempati S, Vedaldi A, Zisserman A, Jawahar CV (2010) Generalized RBF feature maps for efficient detection. In: BMVC, British Machine Vision Association, pp 1–11

Williams C, Seeger M (2000) The effect of the input density distribution on kernel-based classifiers. In: Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, pp 1159–1166

Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in Neural Information Processing Systems 13, MIT Press, pp 682–688, URL http://papers.nips.cc/paper/ 1866-using-the-nystrom-method-to-speed-up-kernel-machines.pdf

A Hybrid PAC-compression bounds

In this section, we present a hybrid compression bound used in the derivation of Theorem 4.

It will be convenient to present our results in generality and then specialize.

Our notation will be in line with Hanneke and Kontorovich (2019). Let P be a distribution

on

We write and our main object of interest

will be

for . The catch is that F may itself be random, determined by the

distinguish from the more familiar object , which is formally defined as in

(14), but with the additional stipulation that F be a fixed function class, independent of

For a fixed , consider a fixed mapping . In words, k-tuples over Z into function classes over Z. This generalizes the notion of a decoding in a sample compression scheme, where -tuple over Denote by the collection of all functions constructable by on a given

whereis the set of all k-subsets of is the restriction of to the index set

The key observation is that, conditioned on , the function class deterministic and independent of

Conditional on , we have, for a given

where were used. It follows that

Theorem 5

To apply (16) to examples of interest, let us compute the right-hand side of the bound for some function classes.

In our first example, suppose that

concept classes — which might well be different for each k-tuple — of VC-dimension at most d.

More precisely, we take is an instance space. Let

be a concept class defined by the , with VC-dimension

to be its associated loss class:

We call this setting a hybrid (k, d) VC sample-compression scheme. It is well-known (see,

e.g., (Anthony and Bartlett, 1999, Theorem 4.9)) that

the extension to general ones is straightforward. The only requisite change consists of replacing

where c > 0 is a universal constant (for concreteness, we may take

is known to be concentrated about its mean (see, e.g., (Mohri et al., 2012, Theorem

3.1)):

Corollary 1 In a hybrid (k, d) VC sample compression scheme, on a sample of size n, a

learner’s sample error and generalization error

with probability at least

Here, we take X to be an abstract set,

and define

where is a map from determined by some

. Associate to -margin loss class

where . We refer to this setting as a

sample compression scheme. It is a standard fact (see, e.g., (Mohri et al., 2012, Theorem

4.4)) that

Combining (16), (19), and a standard stratification argument (see (Mohri et al., 2012, Theorem 4.5)), we obtain the following result. Fix a map . Given a sample drawn iid, the learner chooses some k examples to define the random mapping . Having mapped the sample to , he runs SVM and obtains a hyperplane w.

Corollary 2 With probability at least