b

DiscoverSearch
About
My stuff
Theory of the GMM Kernel
2016·arXiv
Abstract
Abstract

We1 develop some theoretical results for a robust similarity measure named “generalized min-max” (GMM). This similarity has direct applications in machine learning as a positive definite kernel and can be efficiently computed via probabilistic hashing. Owing to the discrete nature, the hashed values can also be used for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a very general family of distributions and includes the multivariate t-distribution as a special case. The consistency result holds as long as the data have bounded first moment (an assumption which essentially holds for datasets commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. Compared to the “cosine” similarity which is routinely adopted in current practice in statistics and machine learning, the consistency of GMM requires much weaker conditions. Interestingly, when the data follow the t-distribution with  νdegrees of freedom, GMM typically provides a better measure of similarity than “cosine” roughly when  ν < 8(which is already very close to normal). These theoretical results will help explain the recent success of GMM [11, 12] in learning tasks.

In statistics and machine learning, it is often crucial to choose, either explicitly or implicitly, some measure of data similarity. The most commonly adopted measure might be the “cosine” similarity:

image

where x and y are n-dimensional data vectors. This measure implicitly assumes that the data have bounded second moment otherwise it will not converge to a fixed limit as the sample size increases. The data encountered in the real-world, however, are virtually always heavy-tailed [9, 4, 5]. [14] argued that the many natural datasets follow the power law with exponent (denote by  ν) varyingbetween 1 and 2. For example,  ν = 1.2for the frequency of use of words,  ν = 2.04for the number of citations to papers,  ν = 1.4for the number of hits on the web sites, etc. Basically,  ν > 2 meansthat data have bounded second moment. The cosine similarity (1) will not converge (as  n → ∞) toa fixed constant if the data do not have bounded second moment.

In this study, we analyze the “generalized min-max” (GMM) similarity. First, we define

image

Then we compute GMM as follows:

image

Note that for nonnative data, GMM becomes the original “min-max” kernel, which has been studied in the literature [8, 3, 13, 7, 10]. This paper focuses on analyzing theoretical properties of GMM. In particular, we are interested in the limit of  gn(x, y) as n → ∞and how fast  gnconverges to the limit. The convergency and speed of convergence are important. For example, the cosine similarity (1) is popular largely because, as long as the data have bounded second moments, Cos(x, y) converges to a fixed limit which is believed to be a good characterization of the similarity between x and y.

To proceed with the analysis, we will have to make assumptions on the data. In this paper, we adopt the “elliptical distribution” model [1] which is very broad and includes many common distributions (such as Gaussian and Cauchy) as special cases. We first provide a simulation study.

The bivariate t-distribution has an explicit density and is a special case of the elliptical distribution. Denote by  tΣ,νthe bivariate t-distribution with covariance matrix  Σ and νdegrees of freedom. Basically, if two independent variables  Z ∼ N(0, Σ) and u ∼ χ2ν, then we have  Z�ν/u ∼ tΣ,ν.

image

compute  GMM(x, y) = gn(x, y)according to (2). We are interested in the mean an standard deviation of GMM for  n ∈ {1, 10, 100, 1000, 10000} and ν ∈ {3, 2, 1, 0.5}, as shown in Figure 1.

The panels in the first (top) row present the mean of GMM (gn).The curves of GMM lie between two fixed curves,  f1 and f∞, which we will calculate to be the following expressions:

image

For better clarity, the panels in the second (middle) row plot the magnified portion. In each panel, the top (dashed and green if color is available) curve represent  f1and the bottom (dashed and red) curve represent  f∞. We can see that for  ν = 3 and ν = 2, gnconverges to  f∞ fast. For ν = 1, gnalso converges to  f∞but much slower. With  ν = 0.5, gndoes not converge to  f∞.

The panels in the third (bottom) row plot the standard deviation (std). For  ν ≥ 1, the stdcurves converge to 0, although at  ν = 1the convergence is much slower. When  ν = 0.5, thestandard deviation does not converge to 0.

Basically, the simulations suggest that  gnconverges to  f∞as long as the data have bounded first moment (i.e.,  ν > 1) and the convergence still holds for the boundary case (i.e.,  ν = 1). Wewill provide thorough theoretical analysis on  gnfor the general elliptical distribution.

Because  ρmeasures data similarity, the fact that  gn → f∞as long as  ν ≥ 1is important because it means we have a robust measure of  ρas long as the data are “reasonably” distributed. As shown by [14], most natural datasets have the equivalent  ν > 1.

image

Figure 1: We simulate  GMM = gndefined in (2) from the bivariate t-distribution with  ν =0.5, 1, 2, 3 degrees of freedom, and n = 1, 10, 100, 1000, for 10000 repetitions. In the panels of the first two rows, we plot the mean curves together with two fixed (dashed) curves  f1 and f∞ definedin (3). The panels in the second row are the zoomed-in version of the panels in the first row. The bottom panels plot the empirical standard deviation of  gn.

We consider  (xi, yi), i = 1 to n, are iid copy of (X, Y ). Our goal is to analyze the statistical behavior of GMM, especially as  n → ∞,

image

To proceed with the theoretical analysis, we make a very general distributional assumption on the data. We say the vector (X, Y ) has an elliptical distribution if

image

where  A = (a1, a2)T is a deterministic  2 × 2 matrix, Uis a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. See [1] for an introduction. In the family of elliptical distributions, there are two important special cases:

image

image

Note that in  Σwe consider  σ ̸= 1to allow the situation that two vectors have different scales. For the convenience of presenting our theoretical results, we summarize the notations:

image

 τ ∈ [−π/2 + 2α, π/2]is the solution of cos(τ − 2α)/ cos τ = σ, i.e., τ = arctan(σ/ sin(2α) −cot(2α)). Note that  τ = α if σ = 1.

In addition, we need the following definitions of  f1(ρ, σ) and f∞(ρ, σ), for general  σas well as  σ = 1:

image

Theorem 1 presents the results for consistency.

Theorem 1. (Consistency) Assume (X, Y ) has an elliptical distribution with  (X, Y )T = AUTand  Σ = AAT =� 1 σρσρ σ2�. Let (xi, yi), i = 1 to n, be iid copies of (X, Y ), and GMM(x, y) =

image

If (X, Y ) has a t-distribution with  νdegrees of freedom, then  gn → f∞(ρ, σ)almost surely if ν > 1 and gn → f∞(ρ, σ)in probability if  ν = 1.

Theorem 2 presents the results for asymptotic normality.

Theorem 2. (Asymptotic Normality) With the same notation and definitions as in Theorem 1, the following statements hold:

image

= 14π3�2τ + π − 4α + sin(2τ − 4α) + σ2 (π − 2τ − sin(2τ))�{σ(1 + sin τ) + 1 + sin(2α − τ)}2+ 14π3�σ2 (2τ + sin(2τ) + π) + (π + 4α − 2τ − sin(2τ − 4α)) + 4σ (sin 2α − 2α cos 2α)�

image

Figure 2 presents a simulation study to verify the asymptotic normality, in particular, the asymptotic variance formula

image

by considering that the data follow a t-distribution with  νdegrees of freedom and  ν = 2.5, 3, 4, 5.The simulation results confirm the asymptotic variance formula at large enough sample size n. When n is not too large, the asymptotic variance formula (18) can be conservative.

image

Figure 2: Simulations for verifying the asymptotic variance formula (18) based on t-distribution with  νdegrees of freedom where  ν ∈ {2.5, 3, 4, 5} and ρ ∈ [−1, 1]spaced at 0.01. For each case, we repeat the simulation 10000 times. We report the empirical  V ar(gn) × nwith the theoretical asymptotic value VH4 ET 2E2T plotted as dashed curves. For n large enough, the asymptotic variance formula (18) becomes accurate. For small n values, however, the formula can be quite conservative.

The fact that  gn(x, y) → f∞(ρ, σ)also provides a robust and convenient way to estimate the similarity between data vectors. Here, for convenience we consider  σ = 1. For this case, we have f∞ =1−√(1−ρ)/21+√(1−ρ)/2. This suggests an estimator of  ρ:

image

As  n → ∞, gn → f∞ and ˆρg → ρ. In other words, the estimator  ˆρgis asymptotically unbiased. The asymptotic variance of  ˆρgcan be computed using “delta method”:

image

See (18) and Theorem 2 for more details. Again, we emphasize that this estimator  ˆρgis meaningful as long as  ET < ∞ and V ar (ˆρg) < ∞as long as  ET 2 < ∞.

It is interesting to compare this estimator with the commonly used estimator based on the “cosine” similarity:

image

When the data are bivariate normal, it is a known result [1] that  cn(x, y), when appropriately normalized, converges in distribution to a normal

image

This asymptotic normality (with difference in the variance term) holds as long as the data have bounded fourth moment. Here, we present the generalization as a theorem.

Theorem 3.  If ET 4 < ∞, then

image

Based on Theorem 3, a natural estimator of  ρand its asymptotic variance would be

image

When the data follow a t-distribution with  νdegrees of freedom, we have

image

Figure 3 and Figure 4 provide a simulation study for comparing two estimators  ˆρg andˆρc. We assume t-distribution with  νdegrees of freedom, where  ν ∈ {2.5, 3, 4, 4.5, 5, 6, 8, 10}as well  ν = ∞(i.e., normal distribution). In each panel, we plot the empirical mean square errors (MSEs): MSE(ˆρg) and MSE(ˆρc)(computed from 10000 repetitions), along with the (asymptotic) theoretical variance of  ˆρg: 1n2 (1 − ρ)�1 +�(1 − ρ)/2�4 VH4 ET 2E2T . For clarity, we did not plot the theoretical variance of  ˆρc, which is fairly simple and more straightforward to be verified.

The results in Figure 3 and Figure 4 confirm that  ˆρg, the estimator based on GMM, is substantially more accurate than  ˆρc, the commonly used estimator based on cosine. Roughly speaking, when  ν < 8, ˆρgis more preferable. Even when the data are perfectly Gaussian (the bottom row in Figure 4), the use of  ˆρgdoes not result in much loss of accuracy compared to  ˆρc.

image

Figure 3: Simulations for comparing two estimators of data similarity  ρ: 1) ˆρg, the estimator based on GMM, and 2)  ˆρc, the estimator based on cosine. We assume the data follow a t distribution with νdegrees of freedom. In each panel (for each  ν), we plot the empirical MSE(ˆρg) and MSE(ˆρc) aswell as the theoretical asymptotic variance of  ˆρg: 1n2 (1 − ρ)�1 +�(1 − ρ)/2�4 VH4 ET 2E2T . It is clear from the results that  ˆρgis substantially more accurate than  ˆρc. The theoretical asymptotic variance formula, despite the complexity of its expression, is accurate when  νis not too close to 2.

image

Figure 4: Continued from Figure 3. We present results for larger  ν(5, 6, 8, 10) and  ν = ∞ (i.e.,Gaussian data, the bottom row). Roughly speaking, when  ν < 8, it is preferable to use  ˆρg, theestimator based on GMM. In fact, even when data are perfectly Gaussian, using  ˆρgdoes not result in too much loss of accuracy.

The “cosine” similarity commonly used in practice essentially assumes that the data are normally (or equivalently) distributed. The data in reality, however, are typically heavy-tailed and sparse. A concurrent line of work [11, 12] has shown that the new measure named “generalized min-max” (GMM) is particularly effective as a positive definite kernel and there is an efficient computational procedure to convert this nonlinear kernel into linear kernel. Extensive experiments on more than 50 datasets [11, 12] have demonstrated the promising performance in machine learning tasks. This motivates us to develop the theoretical results for analyzing GMM.

We show that, under mild conditions, GMM converges to a limit as long as the data have bounded first moment. In contrast, the cosine similarity requires that data to have bounded second moment. We derive the explicit expression for the limit and establish the asymptotic normality of GMM with explicit (and sophisticated) variance expressions. Those theoretical results will be useful for further analyzing of GMM in statistics, machine learning, and other applications.

[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, Hoboken, New Jersey, third edition, 2003.

[2] N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation. Cambridge University Press, 1987. Cambridge Books Online.

[3] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002.

[4] M. E. Crovella and A. Bestavros. Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Trans. Networking, 5(6):835–846, 1997.

[5] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the Internet topology. In SIGMOD, pages 251–262, Cambridge,MA, 1999.

[6] B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sum of Independent Random Variables. Addison Wesley, Reading, MA, 1954.

[7] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.

[8] J. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In FOCS, pages 14–23, New York, 1999.

[9] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. Networking, 2(1):1–15, 1994.

[10] P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.

[11] P. Li. Generalized min-max kernel and generalized consistent weighted sampling. Technical report, arXiv:1605.05721, 2016.

[12] P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.

[13] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.

[14] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):232–351, 2005.

For a random vector (X, Y ), we are interested in quantities

image

Without any assumption, we have

image

If (X, Y ) is symmetric in the sense of  (X, Y ) ∼ (−X, −Y ), then

image

and

image

The vector (X, Y ) has an elliptical distribution if

image

where  A = (a1, a2)T is a deterministic  2 × 2 matrix, Uis a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. In this case,  U ∼ −U, so that (X, Y )is symmetric. If T has a finite expectation, then T can be can cancelled in the calculation of  µ1 and

image

and

image

Since a bivariate Gaussian distribution is elliptical with  T 2 ∼ χ22, the elliptical case with finite ET is equivalent to the bivariate Gaussian case

image

image

and

image

We note that sin(2α) = 2 sin α cos α = 2�1/2 − ρ/2�1/2 + ρ/2 = �1 − ρ2, cos(2α) = ρ, andcos(2α − π/2) = sin(2α) =�1 − ρ2. Moreover, tan  τ = {σ − cos(2α)}/ sin(2α) = (σ − ρ)/�1 − ρ2,so that 1/ cos2 τ = 1 + tan2 τ = 1 + (σ − ρ)2/(1 − ρ2) = (1 − ρ2 + σ2 − 2σρ + ρ2)/(1 − ρ2) =(1 +  σ2 − 2σρ)/(1 − ρ2). Thus,

image

Consider the Gaussian case

image

Let

image

We have

image

As  α ∈ (0, π/2), it follows that

image

As cos(θ − 2α)/ cos θ = cos(2α) + tan θ sin(2α), for cos θ > 0 cos(θ − 2α)/ cos θ = σ iff θ = τ andτ ∈ [−π/2 + 2α, π/2]. Thus, with  t = 2α − θ,

image

image

We note that  τ = α when σ = 1. Similarly,

image

It is well known [6, 2] that

image

if and only if

image

This can be seen as follows. Write

image

We have

image

After applying this argument to  (Xi)− ∧ (Yi)−, (Xi)+ ∨ (Yi)+ and (Xi)− ∨ (Yi)−, the conclusion follows from

image

Now consider the bivariate t-distribution as an example:

image

where  χ2ν is independent of  N(0, Σ). Since N(0, Σ)can be written as  AU�χ22, the bivariate t- distribution can be written as

image

with two independent chi-square variables, where  F2,νdenotes the F distribution. It can be shown that

image

For example,  ET = π/√2 for ν = 2. For ν = 1, we still have (25), as (24) follows from

image

Let T be independent of  (ξ, ζ) and (Ti, ξi, ζi)be iid copies of  (T, ξ, ζ). Assume that  ET 2 + E(ξEζ −ζEξ)2 < ∞. Then,

image

with

image

Alternatively, if the condition  ET 2 < ∞is replaced by

image

Suppose (X, Y ) is elliptical and

image

As in the computation of  f∞, we have

image

image

and

image

Moreover

image

and

image

image

Consequently,

image

= 14π3�2τ + π − 4α + sin(2τ − 4α) + σ2 (π − 2τ − sin(2τ))�{σ(1 + sin τ) + 1 + sin(2α − τ)}2+ 14π3�σ2 (2τ + sin(2τ) + π) + (π + 4α − 2τ − sin(2τ − 4α)) + 4σ (sin 2α − 2α cos 2α)�

image

The expression can be simplified when  σ = 1:

image

Thus, when  σ = 1, we have

image

For a bivariate t-distribution with  νdegrees of freedom, we have  T ∼�χ22ν/χ2ν ∼�2F2,ν, and

image

Thus, when  ET 2 < ∞, we have the asymptotic normality

image

For t-distribution with  ν = 2, condition (25) holds as

image

[1] provides the result for the normal case. We extend the results of [1] to the general elliptical family. Again, a vector (X, Y ) has an elliptical distribution if

image

where  A = (a1, a2)T is a deterministic  2 × 2 matrix, Uis a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. We want to compute the asymptotic variance of the sample correlation

image

Due to scale invariance, it suffices to consider the case of  EX2 = EY 2 = 1.

image

Thus, the asymptotic variance of  �ρ is

image

Let  E0be the expectation in the Gaussian case. We have  T 2 ∼ χ22 under E0, E0T 2 = 2, E0T 4 =Var0(T 2)+(ET 20 )2 = 4+4 = 8, and V0 = (1−ρ2)2. A comparison with the solution in the Gaussian case yields

image


Designed for Accessibility and to further Open Science