In statistics and machine learning, it is often crucial to choose, either explicitly or implicitly, some measure of data similarity. The most commonly adopted measure might be the “cosine” similarity:
where x and y are n-dimensional data vectors. This measure implicitly assumes that the data have bounded second moment otherwise it will not converge to a fixed limit as the sample size increases. The data encountered in the real-world, however, are virtually always heavy-tailed [9, 4, 5]. [14] argued that the many natural datasets follow the power law with exponent (denote by between 1 and 2. For example,
for the frequency of use of words,
for the number of citations to papers,
for the number of hits on the web sites, etc. Basically,
that data have bounded second moment. The cosine similarity (1) will not converge (as
a fixed constant if the data do not have bounded second moment.
In this study, we analyze the “generalized min-max” (GMM) similarity. First, we define
Then we compute GMM as follows:
Note that for nonnative data, GMM becomes the original “min-max” kernel, which has been studied in the literature [8, 3, 13, 7, 10]. This paper focuses on analyzing theoretical properties of GMM. In particular, we are interested in the limit of and how fast
converges to the limit. The convergency and speed of convergence are important. For example, the cosine similarity (1) is popular largely because, as long as the data have bounded second moments, Cos(x, y) converges to a fixed limit which is believed to be a good characterization of the similarity between x and y.
To proceed with the analysis, we will have to make assumptions on the data. In this paper, we adopt the “elliptical distribution” model [1] which is very broad and includes many common distributions (such as Gaussian and Cauchy) as special cases. We first provide a simulation study.
The bivariate t-distribution has an explicit density and is a special case of the elliptical distribution. Denote by the bivariate t-distribution with covariance matrix
degrees of freedom. Basically, if two independent variables
, then we have
compute according to (2). We are interested in the mean an standard deviation of GMM for
, as shown in Figure 1.
The panels in the first (top) row present the mean of GMM (The curves of GMM lie between two fixed curves,
, which we will calculate to be the following expressions:
For better clarity, the panels in the second (middle) row plot the magnified portion. In each panel, the top (dashed and green if color is available) curve represent and the bottom (dashed and red) curve represent
. We can see that for
converges to
also converges to
but much slower. With
does not converge to
The panels in the third (bottom) row plot the standard deviation (std). For curves converge to 0, although at
the convergence is much slower. When
standard deviation does not converge to 0.
Basically, the simulations suggest that converges to
as long as the data have bounded first moment (i.e.,
) and the convergence still holds for the boundary case (i.e.,
will provide thorough theoretical analysis on
for the general elliptical distribution.
Because measures data similarity, the fact that
as long as
is important because it means we have a robust measure of
as long as the data are “reasonably” distributed. As shown by [14], most natural datasets have the equivalent
Figure 1: We simulate defined in (2) from the bivariate t-distribution with
0.5, 1, 2, 3 degrees of freedom, and n = 1, 10, 100, 1000, for 10000 repetitions. In the panels of the first two rows, we plot the mean curves together with two fixed (dashed) curves
in (3). The panels in the second row are the zoomed-in version of the panels in the first row. The bottom panels plot the empirical standard deviation of
We consider , are iid copy of (X, Y ). Our goal is to analyze the statistical behavior of GMM, especially as
To proceed with the theoretical analysis, we make a very general distributional assumption on the data. We say the vector (X, Y ) has an elliptical distribution if
where is a deterministic
is a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. See [1] for an introduction. In the family of elliptical distributions, there are two important special cases:
Note that in we consider
to allow the situation that two vectors have different scales. For the convenience of presenting our theoretical results, we summarize the notations:
• is the solution of cos(
cot(2
. Note that
In addition, we need the following definitions of , for general
as well as
Theorem 1 presents the results for consistency.
Theorem 1. (Consistency) Assume (X, Y ) has an elliptical distribution with and
, be iid copies of (X, Y ), and GMM(x, y) =
• If (X, Y ) has a t-distribution with degrees of freedom, then
almost surely if
in probability if
Theorem 2 presents the results for asymptotic normality.
Theorem 2. (Asymptotic Normality) With the same notation and definitions as in Theorem 1, the following statements hold:
) + 1 + sin(2
+ 14
Figure 2 presents a simulation study to verify the asymptotic normality, in particular, the asymptotic variance formula
by considering that the data follow a t-distribution with degrees of freedom and
The simulation results confirm the asymptotic variance formula at large enough sample size n. When n is not too large, the asymptotic variance formula (18) can be conservative.
Figure 2: Simulations for verifying the asymptotic variance formula (18) based on t-distribution with degrees of freedom where
spaced at 0.01. For each case, we repeat the simulation 10000 times. We report the empirical
with the theoretical asymptotic value
plotted as dashed curves. For n large enough, the asymptotic variance formula (18) becomes accurate. For small n values, however, the formula can be quite conservative.
The fact that also provides a robust and convenient way to estimate the similarity between data vectors. Here, for convenience we consider
. For this case, we have
. This suggests an estimator of
As . In other words, the estimator
is asymptotically unbiased. The asymptotic variance of
can be computed using “delta method”:
See (18) and Theorem 2 for more details. Again, we emphasize that this estimator is meaningful as long as
as long as
It is interesting to compare this estimator with the commonly used estimator based on the “cosine” similarity:
When the data are bivariate normal, it is a known result [1] that , when appropriately normalized, converges in distribution to a normal
This asymptotic normality (with difference in the variance term) holds as long as the data have bounded fourth moment. Here, we present the generalization as a theorem.
Theorem 3.
Based on Theorem 3, a natural estimator of and its asymptotic variance would be
When the data follow a t-distribution with degrees of freedom, we have
Figure 3 and Figure 4 provide a simulation study for comparing two estimators -distribution with
degrees of freedom, where
as well
(i.e., normal distribution). In each panel, we plot the empirical mean square errors (MSEs):
(computed from 10000 repetitions), along with the (asymptotic) theoretical variance of
. For clarity, we did not plot the theoretical variance of
, which is fairly simple and more straightforward to be verified.
The results in Figure 3 and Figure 4 confirm that , the estimator based on GMM, is substantially more accurate than
, the commonly used estimator based on cosine. Roughly speaking, when
is more preferable. Even when the data are perfectly Gaussian (the bottom row in Figure 4), the use of
does not result in much loss of accuracy compared to
Figure 3: Simulations for comparing two estimators of data similarity , the estimator based on GMM, and 2)
, the estimator based on cosine. We assume the data follow a t distribution with
degrees of freedom. In each panel (for each
), we plot the empirical MSE(
well as the theoretical asymptotic variance of
. It is clear from the results that
is substantially more accurate than
. The theoretical asymptotic variance formula, despite the complexity of its expression, is accurate when
is not too close to 2.
Figure 4: Continued from Figure 3. We present results for larger (5, 6, 8, 10) and
Gaussian data, the bottom row). Roughly speaking, when
, it is preferable to use
estimator based on GMM. In fact, even when data are perfectly Gaussian, using
does not result in too much loss of accuracy.
The “cosine” similarity commonly used in practice essentially assumes that the data are normally (or equivalently) distributed. The data in reality, however, are typically heavy-tailed and sparse. A concurrent line of work [11, 12] has shown that the new measure named “generalized min-max” (GMM) is particularly effective as a positive definite kernel and there is an efficient computational procedure to convert this nonlinear kernel into linear kernel. Extensive experiments on more than 50 datasets [11, 12] have demonstrated the promising performance in machine learning tasks. This motivates us to develop the theoretical results for analyzing GMM.
We show that, under mild conditions, GMM converges to a limit as long as the data have bounded first moment. In contrast, the cosine similarity requires that data to have bounded second moment. We derive the explicit expression for the limit and establish the asymptotic normality of GMM with explicit (and sophisticated) variance expressions. Those theoretical results will be useful for further analyzing of GMM in statistics, machine learning, and other applications.
[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, Hoboken, New Jersey, third edition, 2003.
[2] N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation. Cambridge University Press, 1987. Cambridge Books Online.
[3] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002.
[4] M. E. Crovella and A. Bestavros. Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Trans. Networking, 5(6):835–846, 1997.
[5] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the Internet topology. In SIGMOD, pages 251–262, Cambridge,MA, 1999.
[6] B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sum of Independent Random Variables. Addison Wesley, Reading, MA, 1954.
[7] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.
[8] J. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In FOCS, pages 14–23, New York, 1999.
[9] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. Networking, 2(1):1–15, 1994.
[10] P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.
[11] P. Li. Generalized min-max kernel and generalized consistent weighted sampling. Technical report, arXiv:1605.05721, 2016.
[12] P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.
[13] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
[14] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):232–351, 2005.
For a random vector (X, Y ), we are interested in quantities
Without any assumption, we have
If (X, Y ) is symmetric in the sense of
and
The vector (X, Y ) has an elliptical distribution if
where is a deterministic
is a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. In this case,
is symmetric. If T has a finite expectation, then T can be can cancelled in the calculation of
and
Since a bivariate Gaussian distribution is elliptical with , the elliptical case with finite ET is equivalent to the bivariate Gaussian case
and
We note that sin(2cos(2
. Moreover, tan
so that 1
(1 +
Consider the Gaussian case
Let
We have
As , it follows that
As cos(. Thus, with
We note that . Similarly,
if and only if
This can be seen as follows. Write
We have
After applying this argument to , the conclusion follows from
Now consider the bivariate t-distribution as an example:
where is independent of
can be written as
, the bivariate t- distribution can be written as
with two independent chi-square variables, where denotes the F distribution. It can be shown that
For example, , we still have (25), as (24) follows from
Let T be independent of be iid copies of
. Assume that
with
Alternatively, if the condition is replaced by
Suppose (X, Y ) is elliptical and
As in the computation of
and
Moreover
and
Consequently,
) + 1 + sin(2
+ 14
The expression can be simplified when
Thus, when
For a bivariate t-distribution with degrees of freedom, we have
Thus, when , we have the asymptotic normality
For t-distribution with , condition (25) holds as
[1] provides the result for the normal case. We extend the results of [1] to the general elliptical family. Again, a vector (X, Y ) has an elliptical distribution if
where is a deterministic
is a vector uniformly distribution in the unit circle and T is a positive random variable independent of U. We want to compute the asymptotic variance of the sample correlation
Due to scale invariance, it suffices to consider the case of
Thus, the asymptotic variance of
Let be the expectation in the Gaussian case. We have
Var
. A comparison with the solution in the Gaussian case yields