Tunable GMM Kernels

2017·arXiv

Abstract

1 Introduction

The “generalized min-max (GMM)” kernel [9] was introduced for large-scale search and machine learning, owing to its efficient linearization via either hashing or the Nystrom method [10]. For defining the GMM kernel, the first step is a simple transformation on the original data. Consider, for example, the original data vector We define the following transformation, depending on whether an entry is positive or negative:

For example, when 4 6], the transformed data vector becomes ˜u = [0 4 6 0]. The GMM kernel is defined [9] as follows:

Even though the GMM kernel has no tuning parameter, it performs surprisingly well for clas-sification tasks as empirically demonstrated in [9] (also see Table 1 and Table 2), when compared to the best-tuned radial basis function (RBF) kernel:

where 0 is a crucial tuning parameter.

Furthermore, the (nonlinear) GMM kernel can be efficiently linearized via hashing [11, 3, 8] (or the Nystrom method [10]). This means we can use the linearized GMM kernel for large-scale machine learning tasks essentially at the cost of linear learning.

Naturally, one would ask whether we can improve this (tuning-free) GMM kernel by introducing tuning parameters. For example, we can define the following “exponentiated-GMM” (eGMM) kernel:

and the “powered-GMM” (pGMM) kernel:

Of course, we can also combine these two kernels:

In this study, we will provide an empirical study on kernel SVMs based on the above three tunable GMM kernels. Perhaps not surprisingly, the improvements can be substantial on some datasets. In particular, we will also compare them with deep nets and trees on 11 datasets [4]. In their previous studies, [5, 6, 7] developed tree methods including “abc-mart”, “robust logitboost”, and “abc-robust-logitboost” and demonstrated their excellent performance on those 11 datasets (and other datasets), by establishing the second-order tree-split formula and new derivatives for multi-class logistic loss. Compared to tree methods like “abc-robust-logitboost” (which are slow and need substantial model sizes), the proposed tunable GMM kernels produced largely comparable classification results.

2 An Experimental Study on Kernel SVMs

We essentially use similar datasets as in [9]. Table 1 lists a large number of publicly available datasets from the UCI repository and Table 2 presents datasets from the LIBSVM website and the 11 datasets for testing deep learning methods and trees [4, 7]. In both tables, we report the kernel SVM test classification results for the linear kernel, the best-tuned RBF kernel, the original (tuning-free) GMM kernel, the best-tuned eGMM kernel, and the pGMM kernel. For the epGMM kernel, the experimental results are reported in Section 3, e.g., Table 3.

In all the experiments, we adopt the -regularization (with a regularization parameter C) and report the test classification accuracies at the best C values in Table 1 and Table 2. More detailed results for a wide range of C values are reported in Figures 1, 2, and 3. To ensure repeatability, we use the LIBSVM pre-computed kernel functionality, at the significant cost of disk space. For the RBF kernel, we follow [9], by exhaustively experimenting with 58 different values of 0.01, 0.1:0.1:2, 2.5, 3:1:20 25:5:50, 60:10:100, 120, 150, 200, 300, 500, 1000}. Basically, Table 1 and Table 2 report the best RBF results among all values in our experiments. Here, 3:1:20 is the matlab notation, meaning that the iterations stat at 3 and terminate at 20, at a space of 1.

For the eGMM kernel, we experiment with the same set of (58) values as for the RBF kernel. For the pGMM kernel, however, because we have to materialize (store) a kernel matrix for each disk space becomes a serious concern. Therefore, for the pGMM kernel, we only search in the range of 30 : 10 : 100}. In other words, the performance of the pGMM kernel (and the epGMM kernel) would be further improved if we expand the range of search or granularity of spacing.

The classification results in Table 1 and 2 and Figures 1, 2 and 3 confirm that the eGMM and pGMM kernels typically improve the original GMM kernel. On a good fraction of datasets, the improvements can be very significant. In fact, Section 3 will show that using the epGMM kernel can bring in further improvements. Nevertheless, the RBF kernel still exhibits the best performance on a very small number of datasets. This is great because it means there is still room for improvement in future study.

Table 1: report the test classification accuracies for the linear kernel, the best-tuned RBF kernel, the original (tuning-free) GMM kernel, the best-tuned eGMM kernel, and the best-tuned pGMM kernel, at their individually-best SVM regularization C values.

Table 2: Datasets in group 1 are from the LIBSVM website. Datasets in group 2 were used by [4, 7] for testing deep learning algorithms and tree methods.

Figure 1: Test classification accuracies of various kernels using LIBSVM pre-computed kernel functionality. The results are presented with respect to C, which is the -regularized kernel SVM parameter. For RBF, eGMM, and pGMM, at each C, we report the best test accuracies from a wide range of kernel parameter (

Figure 2: Test classification accuracies of various kernels using LIBSVM pre-computed kernel functionality. The results are presented with respect to C, which is the -regularized kernel SVM parameter. For RBF, eGMM, and pGMM, at each C, we report the best test accuracies from a wide range of kernel parameter (

Figure 3: Test classification accuracies of various kernels using LIBSVM pre-computed kernel functionality. The results are presented with respect to C, which is the -regularized kernel SVM parameter. For RBF, eGMM, and pGMM, at each C, we report the best test accuracies from a wide range of kernel parameter (

3 The epGMM Kernel, Comparisons with Deep Nets and Trees

Given two data vectors u and v, the epGMM kernel is defined as

after applying the transformation in (1) to

In our experiments with the pGMM kernel, we searched for the best (i.e., the here) parameter in the range of 10 : 100}. Note that since we have to store a kernel matrix at each , the experiments are costly. For testing the epGMM kernel, we re-use the those pre-computed kernels and experiment with the epGMM kernel using the same values (which is the here) as for the RBF and eGMM kernels.

The experimental results are reported in Table 3 (the last column). We can see that the epGMM kernel indeed improves over the eGMM and pGMM kernels, as one would have expected. The improvements can be quite noticeable on those datasets.

Table 3: We add the results (test classification accuracies) for the epGMM kernel as the last column. We mainly focus on the datasets in group 1, for the purpose of comparing with deep nets and trees.

The 11 datasets in Group 1 of Table 3 were already used for testing deep learning algorithms and tree methods [4, 7]. It is perhaps surprising that the performance of the pGMM kernel (and the epGMM kernel) can be largely comparable to deep nets and boosted trees, as shown in Figure 4 and Table 4. These results are exciting, because, that this point, we merely use kernel SVM with single kernels. It is reasonable to expect that additional improvements might be achieved in future studies.

In their studies, [5, 6, 7] developed tree methods including “abc-mart”, “robust logitboost”, and “abc-robust-logitboost” and demonstrated their excellent performance on those 11 datasets (and other datasets), by establishing the second-order tree-split formula and new derivatives for multi-class logistic loss function. They always used a special histogram-based implementation named “adaptive binning”, and the “best-first” strategy for determining the region for the next split (thus, the trees were not balanced as they did not directly control the levels of depth.).

Figure 4 reports the test classification error rates (lower is better) for six datasets: M-Noise1, M-Noise2, ..., M-Noise6. In the left panel, we plot the results of the GMM kernel, the eGMM kernel, and the epGMM kernel, together with the results of two deep learning algorithms as reported in [4]. We can see that for most of those six datasets, the pGMM kernel and the epGMM kernel achieve the best accuracy. In the right panel of Figure 4, we compare epGMM with four boosted tree methods: mart, abc-mart, robust logitboost, and abc-robust-logitboost.

The “mart” tree algorithm [1] has been popular in industry practice, especially in search. At each boosting step, it uses the first derivative of the logistic loss function as the residual response to fit regression trees, to achieve excellent robustness and fairly good accuracy. The earlier work on “logitboost” [2] were believed to exhibit numerical issues (which in part motivated the development of mart). It turns out that the numerical issue does not actually exist after [7] derived the tree-split formula using both the first and second order derivatives of the logistic loss function. [7] showed the “robust logitboost” in general improves “mart”, as can be seen from Figure 4 (right panel).

Figure 4: Classification test error rates on M-Noise1, M-Noise2, ..., M-Noise6 datasets. The left panel compares GMM, pGMM, and epGMM with two deep learning algorithms as reported in [4]. The right panel compares epGMM with four boosted tree methods as reported in [7].

[5, 6, 7] made an interesting (and perhaps brave) observation that the derivatives (as in text books) of the classical logistic loss function can be written in a different form for the multi-class case, by enforcing the “sum-to-zero” constraints. At each boosting step, they identify a “base class” either by the “worst-class” criterion [5] or the exhaustive search method as reported in [6, 7]. This “adaptive base class (abc)” strategy can be combined with either mart or robust logitboost; hence the names “abc-mart” and “abc-robust-logitboost”. The improvements due to the use of “abc” strategy can also be substantial. Again, as mentioned earlier, in all the tree implementations, they [5, 6, 7] always used the adaptive-binning strategy for simplifying the implementation and speeding up training. Also, they followed the “best-first” criterion whereas many tree implementations used balanced trees (which may cause “data-imbalance” and reduce accuracy).

Table 4: Test error rates of five additional datasets reported in [4, 7]. The results in group 1 are from [4], where they compared kernel SVM, neural nets, and deep learning. The results in group 3 are from [7], which compared four boosted tree methods with deep nets.

Table 4 reports the test error rates on five other datasets: M-Basic, M-Rotate, M-Image, MRand, and M-RotImg. In group 1 (as reported in [4]), the results show that (i) the kernel SVM with RBF kernel outperforms the kernel SVM with polynomial kernel; (ii) deep learning algorithms usually beat kernel SVM and neural nets. Group 2 presents the same results as in Table 3 (in terms of error rates as opposed to accuracies). We can see that pGMM and epGMM outperform deep learning methods except for M-Rand. In group 3, overall the tree methods especially abc-robust-logitboost achieve very good accuracies. The results of pGMM and epGMM are largely comparable to the results of tree methods.

The training of boosted trees is typically slow (especially in high-dimensional data) because a large number of trees are usually needed in order to achieve good accuracies. Consequently, the model sizes of tree methods are usually large. Therefore, it would be exciting to have methods which are simpler than trees and achieve comparable accuracies.

4 Hashing the pGMM Kernel

It is now well-understood that it is highly beneficial to be able to linearize nonlinear kernels so that learning algorithms can be easily scaled to massive data. The prior work [9] has already demonstrated the effectiveness of the generalized consistent weighted sampling (GCWS) [11, 3, 8] for hashing the GMM kernel. In this study, we modify GCWS for linearizing the pGMM kernel as summarized in Algorithm 1.

With k samples, we can estimate pGMM(u, v) according to the following collision probability:

or, for implementation convenience, the approximate collision probability [8]:

For each vector u, we obtain k random samples . We store only the lowest b bits of . We need to view those k integers as locations (of the nonzeros). For example, when b = 2, we should view as a binary vector of length 2such vectors into a binary vector of length 2, which contains exactly k 1’s. We then feed the new data vectors to a linear classifier if the task is classification. The storage and computational cost is largely determined by the number of nonzeros in each data vector, i.e., the k in our case. This scheme can of course also be used for many other tasks including clustering, regression, and near neighbor search.

Note that the performance of pGMM can be heavily impacted by the tuning parameter the definition of the pGMM kernel. Figure 5 presents examples on M-Rotate and M-Image.

Figure 5: on the pGMM kernel SVM classification accuracies for two datasets. -

Figure 6 presents the experimental results on hashing for M-Rotate. For this dataset, is the best choice (among the range of values we have searched). Figure 6 plots the results for both 25 (left panels) and . Recall here b is the number of bits for representing each hashed value in the “0-bit CWS” scheme [8]. The results demonstrate that: (i) hashing using 25 produces better results than hashing using able to use a fairly large b value, for example, 4 or 8. Using smaller b values (e.g., b = 2) hurts the accuracy; (iii) With merely a small number of hashes (e.g., k = 128), the linearized pGMM kernel can significantly outperform the original linear kernel. Note that the original dimensionality is 784. This example illustrates the significant advantage of nonlinear kernel and hashing.

Figure 7 presents the experimental results on hashing for M-Noise1 dataset and M-Noise3 dataset, respectively on the left panels (the experimental results on hashing for M-Image dataset and M-RotImg dataset, respectively on the left panels (the results in Figure 6, confirming the significant advantages of the pGMM kernel and hashing.

Figure 9 (for CTG dataset) and Figure 10 (for SpamBase dataset) are somewhat different from the previous figures. For both datasets, using 05 achieves the best accuracy. We plot the results for 2, to visualize the trend.

5 Conclusion

It is commonly believed that deep learning algorithms and tree methods can produce the state-of-the-art results in many statistical machine learning tasks. In 2010, [7] reported a set of surprising experiments on the datasets used by the deep learning community [4], to show that tree methods can outperform deep nets on a majority (but not all) of those datasets and the improvements can be substantial on a good portion of datasets. [7] introduced several ideas including the second-order tree-split formula and the new derivatives for multi-class logistic loss function. Nevertheless, tree methods are slow and their model sizes are typically large.

In machine learning practice with massive data, it is desirable to develop algorithms which run almost as efficient as linear methods (such as linear logistic regression or linear SVM) and achieve similar accuracies as nonlinear methods. In this study, the tunable linearized GMM kernels are promising tools for achieving those goals. Our extensive experiments on the same datasets used for testing tree methods and deep nets demonstrate that tunable GMM kernels and their linearized versions through hashing can achieve comparable accuracies as trees. In general, the state-of-the-art boosted tree method called “abc-robust-logitboost” typically achieves better accuracies than the proposed tunable GMM kernels. Also, on some datasets, deep learning methods or RBF kernel SVM outperform tunable GMM kernels. Therefore, there is still room for future improvements.

In this study, we focus on testing tunable GMM kernels and their linearized versions using clas-sification tasks. It is clear that these techniques basically generate new data representations and hence can be applied to a wide variety of statistical learning tasks including clustering and regression. Due to the discrete name of the hashed values, the techniques naturally can also be used for building hash tables for fast near neighbor search.

The current version of this paper is mainly a technical note for supporting the recent work on “The Linearized GMM Kernels and Normalized Random Fourier Features” [9].

Figure 6: Test classification accuracies for using linear classifiers combined with hashing in Algorithm 1 on M-Rotate dataset, for 25 (left panels) and In each panel, the four solid curves correspond to results with k hashes for For comparisons, in each panel we also plot the results of linear classifiers on the original data (lower curve) and the results of pGMM kernel SVMs (higher curve), in two marked solid curves.

Figure 7: Test classification accuracies for using linear classifiers combined with hashing in Algorithm 1 on M-Noise1 dataset (left panels) and M-Noise3 dataset (right panels). In each panel, the four solid curves correspond to results with k hashes for . For comparisons, in each panel we also plot the results of linear classifiers on the original data (lower curve) and the results of pGMM kernel SVMs (higher curve), in two marked solid curves.

Figure 8: Test classification accuracies for using linear classifiers combined with hashing in Algorithm 1 on M-Image dataset (left panels) and M-RotImg dataset (right panels). In each panel, the four solid curves correspond to results with k hashes for . For comparisons, in each panel we also plot the results of linear classifiers on the original data (lower curve) and the results of pGMM kernel SVMs (higher curve), in two marked solid curves.

Figure 9: Test classification accuracies for using linear classifiers combined with hashing in Algorithm 1 on CTG dataset, for to visualize the trend that, for this dataset, the accuracy decreases with increasing . Three columns presents results for b = 8, 4, 2, respectively.

Figure 10: Test classification accuracies for using linear classifiers combined with hashing in Algorithm 1 on SpamBase dataset, for (from top to bottom) to visualize the trend that, for this dataset, the accuracy decreases with increasing . Three columns presents results for b = 8, 4, 2, respectively (from left to right).

References

[1] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.

[2] J. H. Friedman, T. J. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000.

[3] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.

[4] H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pages 473–480, Corvalis, Oregon, 2007.

[5] P. Li. Adaptive base class boost for multi-class classification. CoRR, abs/0811.1250, 2008.

[6] P. Li. Abc-boost: Adaptive base class boost for multi-class classification. In ICML, pages 625–632, Montreal, Canada, 2009.

[7] P. Li. Robust logitboost and adaptive base class (abc) logitboost. In UAI, 2010.

[8] P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.

[9] P. Li. Linearized GMM kernels and normalized random fourier features. Technical report, arXiv:1605.05721, 2016.

[10] P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.

[11] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.

Designed for Accessibility and to further Open Science