Generalized Intersection Kernel

2016·arXiv

Abstract

1 Introduction

Following the idea from the recently proposed “generalized min-max (GMM)” kernel [7], we propose two types of nonlinear kernels which are basically tuning free and can handle data vectors with both negative and positive entries. The first step is a simple transformation on the original data. Consider, for example, the original data vector . We define the following transformation, depending on whether an entry is positive or negative:

For example, when 5 3], the transformed data vector becomes ˜u = [0 5 3 0].

Once we have only nonnegative data, we can define the GMM kernel as proposed in [7]

Inspired by the above idea, a variety of nonlinear kernels can be analogously defined, for example, the “generalized intersection (GInt)” kernel:

and the “normalized GMM (NGMM)” kernel:

Note that original (histogram) intersection kernel has been a popular tool in computer vision [11]. In this study, we will provide an extensive empirical evaluation of the GInt kernel and NGMM kernel on 40 classification datasets from the UCI repository. The empirical results indicate that the NGMM kernel typically outperforms the GInt kernel. In addition, there is an interesting connection between these two kernels, because we can re-write the NGMM kernel as

which means that the NGMM kernel is a monotonic transformation of the GInt kernel. This connection can be also be interpreted as that the NGMM kernel is an “asymmetrically transformed” version of the GInt kernel, based on the recent idea of asymmetric hashing [14].

Next, we first provide an empirical study on kernel SVMs based on the aforementioned kernels, followed by an empirical study on hashing (linearizing) the NGMM kernel.

2 An Experimental Study on Kernel SVMs

Table 1 lists 40 publicly available datasets, solely from the UCI repository, for our experimental study, along with the kernel SVM classification results for the RBF kernel, the GMM kernel, and the proposed NGMM kernel and GInt kernel, at the (individually) best -regularization C values. More detailed results (for all regularization C values) are available in Figure 1. To ensure repeatability, we use the LIBSVM pre-computed kernel functionality. Note that for the RBF kernel with a scale parameter , we report the best result among a wide range of the

The classification results in Table 1 and Figure 1 indicate that, on these datasets, the GInt kernel significantly outperforms the linear kernel, confirming the advantage of exploring data nonlinearity. The NGMM kernel typically outperforms the GInt kernel. It appears that the GInt kernel can be fairly safely replaced by the NGMM kernel, especially as the NGMM kernel can efficiently computed.

Table 1: 40 public (UCI) datasets and kernel SVM results. We report the test classification accuracies for the linear kernel, the best-tuned RBF kernel, the GMM kernel, the NGMM kernel, and the GInt kernel, at the best SVM regularization C values.

Figure 1: Test classification accuracies of various kernels using LIBSVM pre-computed kernel functionality. The results are presented with respect to C, which is the -regularized kernel SVM parameter. Note that except for the RBF kernels, all kernels are tuning-free. For RBF, at each C, we report the best test accuracies from a wide range of kernel parameter (

3 Kernel Linearization

The kernel classification experiments in Table 1 and Figure 1 have demonstrated the effectiveness of nonlinear kernels (GInt, NGMM, GMM, and RBF) in terms of prediction accuracies, compared to the linear kernel. However, it is well understood [2] that computing kernels are expensive and the kernel matrix, if fully materialized, does not fit in memory even for relatively small applications.

For example, for a small dataset with merely 60, 000 data points, the 60nel matrix has 3entries. In practice, being able to linearize nonlinear kernels becomes highly beneficial. Randomization (hashing) is a popular tool for kernel linearization. After data linearization, we can then apply our favorite linear learning packages such as LIBLINEAR [3] or SGD (stochastic gradient descent) [1].

There are multiple ways for kernel linearization. See [8] for the work on utilizing the Nystrom method for kernel approximation [13, 15] for GMM (and RBF). In this study, we will focus on the strategy based on GCWS hashing for linearizing the GMM and NGMM kernels.

3.1 GCWS: Generalized Consistent Weighted Sampling

After we have transformed the data according to (1), since the data are now nonnegative, we can apply the original “consistent weighted sampling” [12, 4, 5] to generate hashed data. [7] named this procedure “generalized consistent weighted sampling” (GCWS), as summarized in Algorithm 1.

With k samples, we can estimate NGMM(u, v) according to the following collision probability:

Note that is unbounded. Recently, [5] made an interesting observation that for practical data, it is ok to completely discard . The following approximation

is accurate in practical settings and makes the implementation convenient via the idea of b-bit minwise hashing [9].

For each vector u, we obtain k random samples . We store only the lowest b bits of . We need to view those k integers as locations (of the nonzeros) instead of numerical values. For example, when as a vector of length 2we code it as [1 0 0 0]; if such vectors into a binary vector of length 2, which contains exactly k 1’s. After we have generated such new data vectors for all data points, we feed them to a linear classifier. We can, of course, also use the new data for many other tasks including clustering, regression, and near neighbor search.

Note that for linear learning methods (especially online algorithms), the storage and computational cost is largely determined by the number of nonzeros in each data vector, i.e., the k in our case. It is thus crucial not to use a too large k.

3.2 An Experimental Study on “0-bit” GCWS

Figure 2 reports the test classification accuracies on 0-bit GCWS for hashing the NGMM kernel (solid curves) and the GMM kernel (dashed curves). In each panel, we also report the original linear kernel and NGMM kernel results using two solid and marked curves, with the upper curve for the NGMM kernel and bottom curve for the linear kernel. We report results for We also need to choose b, the number of bits for encoding each hashed value ; we reports experimental results for

The classification results confirm, just like the prior work [7], that the “0-bit” GCWS scheme performs well for hashing the NGMM kernel. Clearly, the accuracies are affected by the choice of b, but not too much especially when k is not too small. In general, we recommend using a larger b if the model size 2is affordable. The training cost is largely determined by k, not much by b. See [7] for the training time comparisons for different b values.

Often times, practitioners are particularly interested in choosing k large enough to exceed the accuracy of linear kernel. This is because in practice linear models are often used and a simple recipe which can be more accurate than linear models and does not increase much the computational cost would be highly desirable. We can see from Figure 2 that typically k does not have to very large in order to outperform the original linear model.

4 Conclusion

We propose the “generalized intersection (GInt)” kernel and the related “normalized generalized min-max (NGMM)” kernel. The original (histogram) intersection kernel has been popular in (e.g.,) computer vision. Interestingly, the NGMM kernel can be viewed as an “asymmetrically transformed” version of the GInt kernel from the perspective of recently proposed “asymmetric hashing” [14]. Our kernel SVM experiments on 40 UCI datasets illustrate that the NGMM kernel typically outperforms the GInt kernel. The recently proposed 0-bit GCWS scheme performs well for approximating the NGMM kernel, as expected. For readers who are interested in Nystrom method (another scheme for kernel linearization), please refer to an earlier technical report [8].

In this study, we focus on reporting results for classification. Obviously, the techniques can be used for many other tasks including regression, clustering, and near neighbor search. One notable advantage of the (0-bit) GCWS is that the hashed values are of discrete nature and can be directly used for building hash tables in the context of sublinear time near neighbor search. The Nystrom method dos not offer this benefit. Note that in the context of near neighbor search, the GCWS hashing provides a scheme for searching near neighbors in terms of not only the NGMM kernel distance but also the GInt kernel distance, because NGMM is a monotonic transformation of GInt.

While the results of this (short) report appear to support more the GMM kernel than the GInt kernel, we hope these two simple (tuning-free) kernels (NGMM and GInt) can provide practitioners more options to choose the appropriate method for their specific domain applications.

Figure 2: Test classification accuracies for using (0-bit) GCWS to hash the NGMM kernel (solid curves) and GMM (dashed curves), for . In each panel, the two solid and marked curves report the original results of the NGMM kernel (upper curve) and the linear kernel (bottom curve). 7

References

[1] L. Bottou. http://leon.bottou.org/projects/sgd.

[2] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors. Large-Scale Kernel Machines. The MIT Press, Cambridge, MA, 2007.

[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[4] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.

[5] P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.

[6] P. Li. Sign stable random projections for large-scale learning. Technical report, arXiv:1504.07235, 2015.

[7] P. Li. Generalized min-max kernel and generalized consistent weighted sampling. Technical report, arXiv:1605.05721, 2016.

[8] P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.

[9] P. Li and A. C. K¨onig. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010.

[10] P. Li and C.-H. Zhang. Theory of the gmm kernel. Technical report, arXiv:1608.00550, 2016.

[11] S. Maji, A. Berg, and J. Malik. Classification using intersection kernel support vector machines is efficient. In CVPR, pages 1–8, 2008.

[12] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.

[13] E. J. Nystr¨om. ¨Uber die praktische aufl¨osung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 54(1):185–204, 1930.

[14] A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In WWW, pages 981–991, 2015.

[15] C. K. I. Williams and M. Seeger. Using the nystr¨om method to speed up kernel machines. In NIPS, pages 682–688. 2001.

Designed for Accessibility and to further Open Science