b

DiscoverSearch
About
My stuff
Nested Barycentric Coordinate System as an Explicit Feature Map
2020·arXiv
Abstract
Abstract

We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices’ barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods.

image

Kernel methods provide two principal benefits: (1) They implicitly induce a non-linear feature map, which allows for a richer space of classifiers and (2) when the kernel trick is available, they effectively replace the dimension d of the feature space with the sample size n as the computational complexity parameter. As such, these are well-suited for the ‘high dimension, moderate data size’ regime. For very large datasets, however, naive use of kernel methods becomes prohibitive. The cost is incurred both at the training stage, where an optimal classifier is searched for over an n-dimensional space, and at the hypothesis evaluation stage, where a sum of n kernel evaluations must be computed.

For these reasons, for large data sets, explicit feature maps are sometimes preferred. Various approximations have been proposed to mitigate the computational challenges associated with explicit feature maps, including Chang et al. (2010); Maji et al. (2012); Perronnin et al. (2010); Rahimi and Recht (2007); Vedaldi and Zisserman (2012); Li et al. (2010); ?); ?); ?.

Our contribution. We propose a new embedding method which is well-suited for the large sample regime. Our technique consists of partitioning the space into a nested hierarchy of simplices, and then embedding each data point into features corresponding to the barycentric coordinates of the simplex that contains it. We then train a linear classifier in the rich feature space obtained from the simplices. For sample size n in d-dimensional space, our algorithm has runtime  O(d2n)regardless of the dimension of the embedding space (when the approximation parameter is taken to be fixed, see Sections 4 and 5). In contrast, standard kernelized SVM has a runtime  O(dn2).

Additionally, our embedding technique allows for highly non-linear decision boundaries, although these are linear within each simplex (and hence piecewise-linear overall), as explained in Section 2. At the same time, our approach is sufficiently robust to closely approximate realizable convex bodies – in fact, multiple such bodies – in only linear time in fixed dimension (Section 3). We also give generalization bounds based on empirical margin (Theorem 3) and a novel hybrid sample compression technique (Theorem 4). Finally, we perform an extensive empirical evaluation, in which our method consistently outperforms other explicit feature map classification methods, including a range of popular kernel embedding methods (Section 5).

1.1 Related Work

Kernel approximations for explicit feature maps come in two basic varieties: data-independent approximations to fixed kernels, and data-dependent feature maps.

Data-dependent kernel approximations. This category includes Nystrom’s approximation (Williams and Seeger, 2000), which projects the data onto a suitably selected subspace. If  K(x, zi)is the projection of example x onto the basis element  zi, the points  {z1, . . . , zn}are chosen to maximally capture the data variability. Some methods select  zifrom the sample. The selection can be random (Williams and Seeger, 2001), greedy (Smola and Schökopf, 2000), or involve an incomplete Cholesky decomposition (Fine and Scheinberg, 2001). Perronnin et al. (2010) applied Nystrom’s approximation to each dimension of the data independently, greatly increasing the efficiency of the method.

Data-independent kernel approximations. This category includes sampling the Fourier domain to compute explicit maps for translation invariant kernels. Rahimi and Recht (2007, 2009) do this for the radial basis function kernel, also known as Random Kitchen Sinks. Li et al. (2010); Vedaldi and Zisserman (2012) applied this technique to certain group-invariant kernels, and proposed an adaptive approximation to the  χ2kernel. Porikli and Ozkan (2011) map the input data onto a low-dimensional spectral (Fourier) feature space via a cosine transform. Vempati et al. (2010) proposed a skewed chi squared kernel, which allows for a simple Monte Carlo approximation of the feature map. Maji et al. (2012) approximated the intersection kernel and the  χ2kernel by a sparse closed-form feature map. Pele et al. (2013) suggested using not only piecewise linear function in each feature separately but also to add all pairs of features. Chang et al. (2010) conducted an extensive study on the usage of the second-order polynomial explicit feature map. Bernal et al. (2012) approximated second order features relationships via a Conditional Random Field model.

Decompositions and other SVM approaches. Simplex decompositions have been used to produce proximity-based classifiers (Belkin et al., 2018; Davies, 1996), but to the best of our knowledge, ours is the first work to utilize either nested simplex decompositions or barycentric centers in conjunction with SVM. Simplex decompositions are related to the quadtree, and the quadtree has been used together with SVM for various learning tasks (Saavedra et al., 2004; Beltrami and da Silva, 2015), but not for the creation of a kernel embeddings. Simplex decompositions are more efficient than quadtrees, since a simplex naturally decomposes into only d+1 sub-simplices (Section 2), while a quadtree cell naturally decomposes into  2dsub-cells.

As mentioned, our emphasis in this paper is specifically on explicit feature maps, but there are numerous approaches to reducing kernel SVM runtime (for example the CoreSVM of Tsang et al. (2005, 2007)). Another related paradigm is that of Local SVM (Hao Zhang et al., 2006; Gu and Han, 2013), which assumes continuity of the labels with respect to spacial proximity; similarly labeled points tend to cluster together. This differs from the underlying assumption motivating kernel SVM, which assumes that the data is approximately linearly separable, but not necessarily clusterable. These approaches find success in distinct settings, and are incomparable.

Approximating convex polytopes. Learning arbitrary convex bodies requires very large sample size (Goyal and Rademacher, 2009), and so we focus instead on convex polytopes defined by a small number of halfspaces. However, the problem of finding consistent polytopes is known to be NP-complete even when the polytope is simply the intersection of two hyperplanes (Megiddo, 1988). In fact, Khot and Saket (2011) showed that “unless NP = RP, it is hard to (even) weakly PAC-learn intersection of two halfspaces”, even when allowed the richer class of O(1) intersecting halfspaces. Klivans and Sherstov (2009) showed that learning an intersection of  nε halfspaces is intractable regardless of hypothesis representation (under certain cryptographic assumptions). These negative results have motivated researchers to consider the problem of discovering consistent polytopes which have some separating margin. Several approximation and learning algorithms have been suggested for this problem, featuring bounds with steep dependence on the inverse margin and number of halfspaces forming the polytope (Arriaga and Vempala, 2006; Klivans and Servedio, 2008; Gottlieb et al., 2018; Goel and Klivans, 2018).

In contrast, we show in Section 3 that our method is capable of approximating any convex polytope in linear time (in fixed dimension), independent of the halfspace number and with only logarithmic dependence on the inverse margin. It accomplishes this by finding a linear separator in the higher-dimensional embedded space, and projecting the solution back into the origin space. However, our approach is not strictly comparable to those above, as they are concerned with minimizing the disagreement between the computed polytope (or object) and the true underlying polytope with respect to the point space, while we minimize the volume of the space between them.

Here we describe the nested barycentric coordinate system embedding. We explain its construction and description, how to embed a point from the origin space into the new coordinate system, and how a point in the embedded system can be projected back into the origin space (Section 2.1). We then show that if we associate a weight with each simplex point, then the embedding and weights together imply some (not necessarily convex) polytope on the origin space (Section 2.2). Later in Section 3, we will show that this system is sufficiently robust that it can be used to approximate any convex body.

2.1 Nested barycentric embedding

Let  S ⊂ Rd be a regular simplex of unit side-length, and let  {q0, . . . , qd} be itsvertices. Each point x inside the simplex can be written using the barycentric

coefficients:

image

Here  αi(#„x)denotes the coefficient of point #„xcorresponding to vertex  qi. Let the ordered vector of  α’s,  {α0, . . . , αd}, corresponding to #„xbe denoted as φd+1(#„x). If we artificially augment the original feature space by adding another feature which equals to 1, i.e.  x = (x0, . . . , xd−1, 1) and qt = (q1t, . . . , qd−1,t, 1)and define the matrix  Qt := (q0, . . . , qt), then the transformation  φd+1(#„x) :Rd+1 → Rd+1is a linear transformation of the form  x = Qα.

We can further refine the system by introducing a new point  qdinside the simplex, thereby splitting the simplex into d + 1 new sub-simplices. We order the coordinates of our system as  {q0, . . . , qd}. A point #„xinside the system is embedded by first utilizing the d + 1 vertices of its surrounding simplex to compute the barycentric coefficients (the  α’s) of equation 1. Then #„xis assigned a vector wherein a coordinate corresponding to one of these d + 1 simplex vertices is set to the coefficient of that vertex, and all other coordinates are set equal to 0. This defines the embedding  φd+2(#„x) : Rd+1 → Rd+2.

The refinement process can be continued by choosing points inside simplices to further split these simplices. We define the nested architecture  Btand its associated embedding  φt(x)to be the coordinates  {q0, . . . , qt}constructed from  Bt−1by concatenating a new point  qt at step tto the previous coordinate system. Each point #„xis embedded using the barycentric coefficients of the vertices of the simplex surrounding point x and by assigning those coefficients in the index of the corresponding vertices and by assigning zero to all other vertices. We note that the embedding — the nested barycentric coordinate system — is sparse, as only d + 1 coefficients are non-zero, and also that the embedded points lie on the  L1sphere (� α = 1).

A point in the embedded space can be projected back into the origin space by utilizing the identity

image

2.2 Weights, hyperplanes and polytopes

Given an embedding, we will assign a set of weights #„wto the vertices  {q0, . . . , qt}.Then the set of points R such that:

image

is a union of regions whose boundaries are unions and intersections of hyperplanes. R can represent a polytope as well as the union of several disjoint

image

Fig. 1: Creation of the nested system about a convex polytope

polytopes, each of which is not necessarily convex (see Figure 2 for an illustration):

Lemma 1 Any hyperplane that crosses a single simplex can be defined by a set of weights #„w = {w0, . . . , wd}, such that all points #„xthat lie on the hyperplane satisfy the equation #„w · φd+1(#„x) = 0. Further, R is a union of regions whose boundaries are unions and intersections of hyperplanes, where each simplex contains at most one hyperplane.

Proof Choose d linearly independent points on the given hyperplane. Since these points are inside the coordinate system, they have unique barycentric coefficients and thus a unique representation. Finding these weights is equivalent to solving  A· #„w = #„0 , where Awhich is a matrix of dimension  d× (d+ 1) whoserows are the embeddings  φd+1(#„x)of the points. This is a homogeneous linear system and so the w’s are unique up to a scaling factor. Every point on the hyperplane is a linear combination of those d linearly independent points and thus also satisfies the equation �i wiαi = 0. Likewise, every set of weights represents at most one hyperplane crossing the system.

In Section 3, we will show that a simple nested barycentric system, together with a prudent choice of weights, can be used to closely approximate any given convex body. To this end, we will require a useful property of these systems — essentially, that splitting a simplex cannot decrease the expressiveness of the system. Recall that a barycentric system  Bkis defined by an ordered set of points; we will say that  Bk is contained in Bt (Bk ⊂ Bt) if Bkis a prefix of  Bt.

Theorem 1 Let  Bkbe a nested barycentric system, with  {q0, . . . , qk}as its coordinates, and let  Btbe a nested barycentric system such that  Bk ⊂ Bt. Let

image

Fig. 2: An example of an open polytope and two disjoint polytopes as a nested barycentric system

P be a polytope described in  Bkas:

image

then there exist a set of weights  w′such that P can also be described by  Bt. Further, w is a prefix of  w′.

In order to prove the theorem we must first demonstrate the relationship between coefficients before and after a simplex split.

Definition 1 Let  Btbe a nested barycentric system, with  {q0, . . . , qt}, Let the new splitting point be  qt+1. Since  qt+1 ∈ Bt, it can be written as:

image

We define  βito be the coefficients of the new coordinate of step t + 1 using the coordinate system at step t.

Lemma 2 For a given data point #„pthe connection between the coefficients of step t and t + 1 is:

image

For the simplicity of the notation in this proof we will use  αiinstead of αi(#„p ), and  α∗instead of  αt+1,t+1(#„p ).

Proof The data point #„pat step t can be written as:

image

The point #„pat step t + 1 is written as:

image

Combining equations 5 and 8, we derive:

image

We can now prove theorem 1 by induction:

Proof For a given polytope P with a set of weights  wtat system  Bt, such that  P = {x ∈ S : wt · φt(x) ≥ 0}, and  Bt ⊂ Bt+1, we choose the set of weights  wt+1for  Bt+1as follows: The first t weights of  Bt+1are the same as for  Bt (wi,t = wi,t+1 ∀i < t + 1), and  wt+1,t+1 = �i βiwi(t). Then the scaled distance of every given point represented in  Btfrom the hyperplane γt = wtφt(x), is the same as the scaled distance of the point represented in Bt+1 : γt+1 = wt+1φt+1(x), and specifically the polytope P remains the same. Using Lemma 2 we have:

image

In this section we show that the nested barycentric coordinate system (NBCS) can represent an arbitrarily close approximation to any convex body. As stated the NBCS produces a (not necessarily convex) piece-wise linear classifier. In fact, this method can approximate multiple convex bodies. For simplicity, we focus on the case of a single convex body, and demonstrate how our method approximates it. This will be done by placing split points at the barycenters of their containing simplices, where the barycenter of a simplex with vertices p0, . . . , pdis given by  (p0 + · · · + pd)/(d + 1).

In order to state our result formally, we introduce some notation: Given a point  p ∈ Rdand a parameter  ε > 0, let  Bε(p) = {q ∈ Rd : ∥q − p∥2 ≤ ε}be the ball of radius  εcentered at p. Given a set  X ⊆ Rd, let  X(−ε) = {p ∈ X :Bε(p) ⊆ X}be the set of all points of X that are at distance at least  εfrom the boundary of X. Recall that S denotes the unit simplex.

Theorem 2 Let P ⊆ Sbe a given convex body of diameter 1, and let  0 < ε < 1be given. Then there exists a nested system  Bt, obtained by always placing split points at the barycenters of their containing simplices, and a corresponding set of weights w, such that

image

satisfies the following:

image

Proof The construction proceeds in stages i = 0, 1, . . . , s. (Below, we will take s = 2O(d) ln2(1/ε).) At stage 0 the only points present are the vertices of S. At each stage  i, i ≥ 1, a new split point is placed at the barycenter of each existing simplex, and the final construction is called the s-stage uniform subdivision of S. Let Aibe the set of simplices present at stage i, and clearly  |Ai| = (d + 1)i.Note that all simplices in  Aihave the same volume.

The weights  wiare assigned as follows: Initially, vertices  q0, . . . , qd of S areassigned weights  w0 = · · · = wd = −1. At each stage  i ≥ 1, each new split point is given the smallest possible weight that ensures ˜P ⊇ P, where ˜Pis given by (11). Once a weight is assigned to a point, it is never changed again. In other words, for those points of  Bi+1that already belonged to  Bi, their weights at Bi+1are the same as their weights at  Bi.

Let  S′ ∈ Aibe a simplex with vertices  qi0, . . . , qidand weights  wi0, . . . , wid,respectively. Let  q′ = (qi0 + · · · + qid)/(d + 1)be the barycenter of  S′. By Theorem 1, if  q′ is assigned weight  wavg = (wi0 + · · · wid)/(d + 1), then ˜P ∩ S′remains unchanged. Hence, the weight  w′that will be assigned to  q′by our construction will satisfy  w′ ≤ wavg. And therefore, at each stage, ˜Ponly shrinks. If at stage i a certain simplex  S′ ∈ Ai satisfies S′ ∩ P = ∅, then at stage i + 1 the barycenter of  S′ will be assigned weight  −∞, so that the interior of  S′ willlie completely outside of ˜P.

Let us denote by ˜Psthe region ˜Pproduced by this construction after stage s. (See Figure 3 for an illustration in the plane.) We will now prove that, if s is made large enough, then ˜Psapproximates the given convex body P arbitrarily well, as stated in the theorem.

The diameter of a compact subset of  Rd is the maximum distance between two points in the set. In particular, the diameter of a simplex is the largest distance between two vertices of the simplex.

image

Fig. 3: Four stages of the approximation of a given convex polygon in the plane.

Lemma 3 Let  S′be a simplex with vertices  p0, . . . , pd, let c be the diameter of  S′, and let q be the barycenter of  S′. Then the distance between q and any vertex  piis at most cd/(d + 1).

Proof Fix  pi = 0for concreteness. Then, under the constraints  ∥pj∥2 ≤ cfor j ̸= i, the distance between q and  piis maximized by letting  pj = (c, 0, . . . 0)for all  j ̸= i, which yields the claimed distance.

Lemma 4 Let S′ be a simplex with diameter c. Let A be the collection of the (d + 1)d simplices obtained by a d-stage uniform subdivision of  S′. Then there are at least (d + 1)! simplices in A with diameter at most cd/(d + 1).

Proof By Lemma 3, every simplex in A that contains at most one vertex of  S′will have diameter at most cd/(d + 1). Each time a simplex  S′′is subdivided into d + 1 simplices by an interior point q, the new simplices share only d of their vertices with  S′′. Hence, at stage 1 of the subdivision of  S′, there are d + 1 simplices that share only d vertices with  S′; at stage 2, there are (d + 1)d simplices that share only  d − 1vertices with  S′; and so on, until at stage d there are  (d + 1)d · · · 2 = (d + 1)!simplices that share only one vertex with  S′.

Recall that  Aidenotes the collection of simplices present in the i-stage uniform subdivision of S.

Lemma 5 Let k, z be integers, and set s = zkd. Then at most a�z(1−e−d)k�-fraction of the simplices in  Ashave diameter larger than  (d/(d + 1))z.

Proof By repeated application of Lemma 4. After kd stages, at most an  α-fraction of the simplices in  Akdhave diameter larger than d/(d + 1), for  α =�1 − (d+1)!(d+1)d�k. All the other simplices have diameter at most d/(d + 1). Of the latter simplices, after kd more stages, at most an  α-fraction of their descendants have diameter larger than  (d/(d+1))2. Hence, in  A2kd, the fraction of simplices with diameter larger than  (d/(d + 1))2is at most  α + (1 − α)α < 2α. And so on. In  Azkd, the fraction of simplices with diameter larger than  (d/(d + 1))z isat most  zα. Since  (d + 1)!/(d + 1)d > e−dfor all d, the claim follows.

Now, given  ε, let ρ = ε/(2√2d2). Choose zminimally so that  (d/(d+1))z ≤ρ, and then choose k minimally so that  z(1−e−d)k ≤ ε/2. Let s = zkd. (Hence,we have  s ≤ cd ln2(1/ε)for some c.) Let  Z1be the region surrounding P that is at distance at most  ρfrom P, and let  Z2be the union of all the simplices in  Aswith diameter larger than  ρ. By the choice of s, every point in ˜Ps \ Pbelongs to  Z1 ∪ Z2. Let us bound each of  vol(Z1)and  vol(Z2).

As  ρ → 0(keeping P fixed) we have  vol(Z1) ≤ (1 + o(1))ρ surf(P). Furthermore, P and S are both convex with  P ⊆ S, so  surf(P) ≤ surf(S). Since S = Sdwhere  Sd ⊂ Rdis a regular simplex of unit side-length, we have vol(Sd) =√d + 1/(d!√2d)and  surf(Sd) = (d + 1) vol(Sd−1) ≈√2d2 vol(Sd). Hence, by the choice of  ρ, we have  vol(Z1) ≤ (ε/2) vol(S). By Lemma 5, we also have  vol(Z2) ≤ (ε/2) vol(S). Hence,  vol( ˜Ps \ P) ≤ ε vol(S), and the first item follows.

For the second item, by construction  P ⊆ ˜P. Now given a parameter δ > 0, apply the first part of the theorem with  ε = vol(Bδ)/(2 vol(S)), where vol(Bδ) ≈ δdπd/2/(d/2)!is the volume of a d-dimensional ball of radius  δ. (A calculation shows that  ε ≥ δd, so it suffices to take  s = (c′)d ln2(1/δ)for an appropriate constant  c′.) Suppose for a contradiction that there exists a point  p ∈ ˜P (−δ)sthat is outside of P. Then the ball  B = Bδ(p)is contained in ˜P. But since P is convex, more than half of B is outside of P. Hence, vol( ˜Ps \ P) > vol(B)/2 = ε vol(S), contradicting the first part of the theorem. This implies that ˜P (−δ)s ⊆ P, concluding the second item and the proof of Theorem 2.

In Section 3, we demonstrated that the uniform subdivision embedding, coupled with an appropriate choice of weights, can represent an approximation to any given convex body. This motivates an embedding technique for a linear classifier.

For some parameter q (determined by cross validation), our classification algorithm produces a q-stage uniform subdivision: Beginning with a single simplex covering the entire space, at each stage we add to the system the barycentric center of each simplex, thereby splitting all simplices into d + 1 sub-simplices. We call a set of d + 1 simplices formed by a split siblings. The procedure stops after q stages, having produced  (d+1)q simplices. We note that there is nothing to be gained by splitting an empty simplex, so the algorithm may ignore these; then an empty simplex must have a sibling that contains points, and since a simplex has d siblings, we have that the total number of simplices is not greater than  min{(d + 1)q, dnq}. Parameter q is analogous to depth parameter s of Lemma 5; however, we have consistently observed by empirical cross-validation that it suffices to take q as a very small constant (at most 5), and so we stipulate in our algorithm that q be bound by a small universal constant.

Having computed the nested coordinate system, we use it to embed all points into high-dimensional space. To find an appropriate weight assignment w for the simplex points, we compute a linear classifier on the embedded space to separate the data. A linear classifier takes the form h(x) = sign(w · x), andthis w serves as our weight vector for the embedding. We use soft SVM as our linear classifier, and note that the training phase can be executed in time O(dn) on (d + 1)-sparse vectors (Joachims, 2006). The total runtime of the algorithm is bounded by the cost of executing the sparse SVM plus the total number of simplex points, that is  O(min{d(d+1)q+dn, d2qn} = min{dO(1)+dn, O(d2n)}).

To classify a new point, we can simply search top-bottom for its lowest containing simplex: We begin at the initial simplex, investigate which of its d sub-simplices contains the query point, and iterate on that simplex. This can all be done in time  O(qd2) = O(d2). After bounding the run time, we want to bound the out of sample error:

Theorem 3 If our classifier achieves sample error ˆRwith margin  γ(i.e., ˆRis the fraction of the points whose margin is less than  γ) on a sample of size n after stopping at stage q, its generalization error R is bounded by

image

with probability at least  1 − δ.

This bound is a consequence of the SVM margin bound (Mohri et al., 2012, Theorem 4.5) and the stratification technique (Shawe-Taylor et al., 1998), where the q-th stage receives weight  1/2q.

Adaptive splitting strategies. The above algorithm is data-independent in its selection of split points. It is reasonable to suggest that a data-dependent choice of split points can improve the performance of the learning algorithm. Several greedy strategies suggest themselves, but after empirical trials we suggest the following split heuristic: At every stage, a linear classifier of the embedding space is computed. For each simplex, we identify the points in the simplex have been misclassified so far, and choose a data point which is closest to the the barycentric center of the misclassified points. As before, we limit the heuristic to a constant number of stages, and it is also not necessary to subdivide an empty simplex, or one that contains not many misclassified points. (See Section 5 for empirical results.) The following bounds follow from Corollary 2:

Theorem 4 If our adaptive classifier achieves sample error ˆRwith margin  γ(i.e., ˆRis the fraction of the points whose margin is less than  γ) on a sample of size n after stopping at stage q and retaining k split points, its generalization error R is bounded by

image

with probability at least  1 − δ.

Our embedding technique is motivated by provable bounds for convex polytopes, but we find that it is sufficiently robust to yield impressive empirical results

image

Fig. 4: Learning a polytope separating the red and blue points.

for non-convex polytopes or even general point sets. All experiments utilized the python scikit-learn library (Pedregosa et al., 2011) 1. The regularization parameter C was 5-fold cross-validated over the set  {2−5, 2−3, . . . , 215}, and for the RBF kernel, the  γparameter was five-fold cross-validated over the set {2−15, 2−3, . . . , 23}. For our methods, the maximum iteration parameter q was cross validated over the set {2, . . . , 5}. Our algorithms usually converged even before reaching the maximum number of allowed iterations.

Non-convex polytope approximation. Before presenting the experiments, we give a simple example that illustrates the power of our approach in approximating non-convex polytopes. We created a random data-set wherein all positive examples were taken from within a 5-gon and the negative points from outside it. This data was randomly generated within the unit circle: Each vector was sampled from the unit sphere and then normalized by  u1/d, where random variable  u ∈ [0, 1]is sampled independently at random for each vector. We then sampled 5 halfspaces whose intersection formed the target polytope: For each halfspace, we sampled a random direction vector  wjuniformly from the unit sphere, and then sampled a random offset value  bj ∈ [.05, .95]to produce the halfspace  (wj, bj). The intersection of these halfspaces is the target polytope. All data points inside the polytope with margin 0.05 were labeled as positive, all data points outside the polytope with margin 0.05 were labeled as negative, and the rest were discarded.

Figure 4 shows the iterative boundary formation, where the bold black line is the decision boundary and the dotted lines are the margin (w · φt(x) = ±1) .For each iteration, the nested barycentric system is illustrated by the red lines. A consistent approximation of the underlying polytope for multiple runs was achieved after only 3 iterations. Notice how the margins become smaller at each iteration until reaching their predetermined size.

Benchmarks. We first compared the runtime and accuracy of our methods in Section 4 – uniform subdivision with NBCS (uni-NBCS) and adaptive splitting with NBCS (adapt-NBCS) – to the 2ndand 3rddegree polynomial explicit feature maps, and to the RBF kernel SVM. We used CoreSVM for the RBF kernel, as the LibSVM RBF failed to run on very large datasets. We considered large datasets from LibSVM Machine Learning repository (Chang and Lin, 2011), taking random  70 − 30%splits averaged over 10 random trials. In the LibSVM implementation, the runtime of 2nddegree SVM is  O(d2n)and 3rddegree SVM is  O(d3n). We implemented our algorithm to run in  O(d2n) time.Table 1 shows a summary of our experimental results, and demonstrates that our method compares favorably to the others both in runtime and accuracy. We believe that this is due to NBCS embedding the data into a small but yet very expressive space. We further compared our technique to other explicit feature map methods. Here we focused on accuracy as opposed to runtime, since all these methods have similar runtime complexity, Figure 5 demonstrates a comparison of the average accuracy between our embedding technique (adaptNBCS), Kitchen Sink (KS) (Rahimi and Recht, 2007), Nystrom’s approximation (Williams and Seeger, 2000) and the adaptive  χ2 (Vedaldi and Zisserman, 2012), all of which have open source implementations, over a large variety of medium sized datasets. We also included the accuracy achieved by RBF CoreSVM. Again, our algorithm’s accuracy compared favorably with the others.

In this paper, we introduced the barycentric coordinate system embedding, demonstrated its computational power, and suggested implementation techniques. We derived a statistical foundation for this approach, and presented experiments on LibSVM datasets which show promising empirical results. This method is advantageous in the large data and small to medium feature size regime. Future work includes analytical and empirical investigations of other natural splitting strategies.

image

Table 1: UCI Datasets

image

Fig. 5: classification results for different embedding techniques

Anthony M, Bartlett PL (1999) Neural Network Learning: Theoret- ical Foundations. Cambridge University Press, Cambridge, DOI 10.1017/CBO9780511624216, URL http://dx.doi.org/10.1017/ CBO9780511624216

Arriaga RI, Vempala S (2006) An algorithmic theory of learning: Robust concepts and random projection. Machine Learning 63(2):161–182, URL https://doi.org/10.1007/s10994-006-6265-7

image

Beltrami M, da Silva ACL (2015) Grid-quadtree algorithm for support vector classification parameters selection. Appl Math Sci 9:75–82

Bernal A, Crammer K, Pereira F (2012) Automated gene-model curation using global discriminative learning. Bioinformatics

Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Chang Y, Hsieh C, Chang K, Ringgaard M, Lin C (2010) Training and testing low-degree polynomial data mappings via linear SVM. JMLR

image

Fine S, Scheinberg K (2001) Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2:243–264, URL http://www.jmlr.org/papers/v2/fine01a.html

Goel S, Klivans A (2018) Learning neural networks with two nonlinear layers in polynomial time (arxiv:1709.06010v4)

Gottlieb L, Kaufman E, Kontorovich A, Nivasch G (2018) Learning convex polytopes with margin. In: NeurIPS, pp 5711–5721

Goyal N, Rademacher L (2009) Learning convex bodies is hard, arxiv:0904.1227

Gu Q, Han J (2013) Clustered support vector machines. In: Artificial Intelligence and Statistics, pp 307–315

Hanneke S, Kontorovich A (2019) A sharp lower bound for agnostic learning with sample compression schemes. In: ALT

Hao Zhang, Berg AC, Maire M, Malik J (2006) Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 2126–2136

Joachims T (2006) Training linear SVMs in linear time. In: KDD

Khot S, Saket R (2011) On the hardness of learning intersections of two halfspaces. J Comput Syst Sci 77(1):129–141, URL https://doi.org/10. 1016/j.jcss.2010.06.010

Klivans AR, Servedio RA (2008) Learning intersections of halfspaces with a margin. J Comput Syst Sci 74(1):35–48, URL https://doi.org/10.1016/ j.jcss.2007.04.012

Klivans AR, Sherstov AA (2009) Cryptographic hardness for learning in- tersections of halfspaces. J Comput Syst Sci 75(1):2–12, URL https: //doi.org/10.1016/j.jcss.2008.07.008

Li F, Ionescu C, Sminchisescu C (2010) Random Fourier Approximations for Skewed Multiplicative Histogram Kernels, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 262–271. URL https://doi.org/10.1007/ 978-3-642-15986-2_27

Maji S, Berg A, J M (2012) Efficient classification for additive kernel SVMs. PAMI

Megiddo N (1988) On the complexity of polyhedral separability. Discrete & Computational Geometry 3(4):325–337, URL https://doi.org/10.1007/ BF02187916

Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations Of Machine Learning. The MIT Press

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830

Pele O, Taskar B, Globerson A, Werman M (2013) The pairwise piecewise-linear embedding for efficient non-linear classification. In: ICML

Perronnin F, Senchez J, et al. (2010) Large-scale image categorization with explicit data embedding. In: CVPR

Porikli F, Ozkan H (2011) Data driven frequency mapping for computationally scalable object detection. In: 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 30–35

Rahimi A, Recht B (2007) Random features for large-scale kernel machines. NIPS

Rahimi A, Recht B (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in Neural Information Processing Systems 21, Curran Associates, Inc., pp 1313–1320

Saavedra E, Grauel A, Morton D (2004) Support vector machines and quad- trees applied to image compression. In: Proceedings of the 6th Nordic Signal Processing Symposium-NORSIG, Citeseer, vol 2004

Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44(5):1926–1940

Smola AJ, Schökopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the Seventeenth International Conference on

Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’00, pp 911–918, URL http://dl.acm.org/citation.cfm?id= 645529.657980

Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6(Apr):363–392

Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on Machine learning, ACM, pp 911–918

Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. PAMI

Vempati S, Vedaldi A, Zisserman A, Jawahar CV (2010) Generalized RBF feature maps for efficient detection. In: BMVC, British Machine Vision Association, pp 1–11

Williams C, Seeger M (2000) The effect of the input density distribution on kernel-based classifiers. In: Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, pp 1159–1166

Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in Neural Information Processing Systems 13, MIT Press, pp 682–688, URL http://papers.nips.cc/paper/ 1866-using-the-nystrom-method-to-speed-up-kernel-machines.pdf

A Hybrid PAC-compression bounds

In this section, we present a hybrid compression bound used in the derivation of Theorem 4.

General theory.It will be convenient to present our results in generality and then specialize.

Our notation will be in line with Hanneke and Kontorovich (2019). Let P be a distribution

on  Z. We write Z[n] = (Z1, . . . , Zn) ∼ P n and, for f ∈ [0, 1]Z,

image

We write  ∆n(f) = ∆n(f, P, Z[n]) := R(f, P) − ˆR(f, Z[n])and our main object of interest

will be

image

for  F ⊂ [0, 1]Z. The catch is that F may itself be random, determined by the  Z[n]. We will

distinguish ¯∆n(F)from the more familiar object ¯∆FIXn (F), which is formally defined as in

(14), but with the additional stipulation that F be a fixed function class, independent of

Z[n].

For a fixed  k ∈ N, consider a fixed mapping  ρ : Zk �→ Fk ⊂ [0, 1]Z. In words,  ρ mapsk-tuples over Z into function classes over Z. This generalizes the notion of a decoding in a sample compression scheme, where  ρ maps a k-tuple over  Z into a single function f ∈ [0, 1]Z.Denote by  Fρ(Z[n])the collection of all functions constructable by  ρon a given  Z[n]:

image

where�[n]k�is the set of all k-subsets of  [n], and ZIis the restriction of  Z[n]to the index set

I. 2

image

The key observation is that, conditioned on  ZI, the function class  FI := ρ(ZI) becomesdeterministic and independent of  ZJ, where J := [n] \ I. Thus,

image

Conditional on  ZI, we have, for a given  f ∈ FI,

image

where  f(·) ∈ [0, 1] and |J| = n − kwere used. It follows that

image

Theorem 5

image

To apply (16) to examples of interest, let us compute the right-hand side of the bound for some function classes.

Example: VC classes.In our first example, suppose that  ρ maps k-tuples of Z to binary

concept classes — which might well be different for each k-tuple — of VC-dimension at most d.

More precisely, we take  Z = X ×{0, 1}, where Xis an instance space. Let  H = Hz ⊆ {0, 1}X

be a concept class defined by the  k-tuple z ∈ Zk, with VC-dimension  d. Define F ⊆ {0, 1}Z

to be its associated loss class:

image

We call this setting a hybrid (k, d) VC sample-compression scheme. It is well-known (see,

e.g., (Anthony and Bartlett, 1999, Theorem 4.9)) that

image

the extension to general ones is straightforward. The only requisite change consists of replacing

∪I∈�[n]k� with ∪I∈[n]k in (15).

where c > 0 is a universal constant (for concreteness, we may take  c = 144)3. Further,

¯

∆FIXn−k(F)is known to be concentrated about its mean (see, e.g., (Mohri et al., 2012, Theorem

3.1)):

image

Corollary 1 In a hybrid (k, d) VC sample compression scheme, on a sample of size n, a

learner’s sample error  �err(ˆhn)and generalization error  err(ˆhn) satisfy

image

with probability at least  1 − δ.

Example: Margin classes.Here, we take X to be an abstract set,  Y = {−1, 1}, Z = X ×Y,

and define

image

where  Ψ(x) = Ψz(x)is a map from  X to RN determined by some  k-tuple z ∈ Z, with

∥Ψz(·)∥ ≤ 1. Associate to ˜H the γ-margin loss class

image

where  Φγ(t) = min(0, max(1, 1 − t/γ)). We refer to this setting as a  hybrid (k, γ) margin

sample compression scheme. It is a standard fact (see, e.g., (Mohri et al., 2012, Theorem

4.4)) that

image

Combining (16), (19), and a standard stratification argument (see (Mohri et al., 2012, Theorem 4.5)), we obtain the following result. Fix a map  ρ : Zk → Ψ(·). Given a sample Z[n] = (Xi, Yi)i∈[n]drawn iid, the learner chooses some k examples to define the random mapping  Ψz : X → RN. Having mapped the sample to  RN, he runs SVM and obtains a hyperplane w.

Corollary 2 With probability at least  1 − δ, we have

image


Designed for Accessibility and to further Open Science