The mean shift algorithm is a simple iterative algorithm introduced in [8], which became over years a cornerstone of data clustering. Technically, its purpose is to find local maxima of a function having the following shape:
where is a set of data points belonging to a vector space
is a scale factor, k(.) is a convex and decreasing function from
denotes the Euclidean norm in
optimization is done by means of a sequence
, which we will explicite in Sect. 2. (To alleviate notations, the factor
will be always omitted in this paper). Usually, functions such as f(.) of Eq. (1) are built to approximate an underlying probability density function (p.d.f.) from which the vectors
are assumed to be sampled. Following the terminology used in the kernel density estimation theory,
is the kernel while k(.) is refered to as the profile of this kernel.
As a mode seeking procedure, Mean Shift can be used for clustering: high density areas surrounding the modes of can be regarded as clusters of the point cloud
The advantage of this approach, and more generally of density-based methods lies in the reduced tuning work they require compared to other classical clustering algorithms. In partitioning-based clustering for example, to which belong the very popular K-means algorithm and its variations [15, 6, 2, 16], the number of clusters has to be set by the user. Hierarchical methods (such as OPTICS [1] or CURE [10]) return a dendrogram conveying the structure of the data set, not directly turnkey clusters. Model-based methods need even more input from the user as they assume the data follow some specific model characterized by a set of parameters to be estimated. Some well known examples of such approaches are Expectation Maximization (EM) [5] or Self-Organizing Map [13]. To come back to density-based methods, they do resort to prior knowledge on the structure of data through a choice of kernel metric. In particular, setting a proper scaling requires having an idea of the order of magnitude of the distance between points inside a cluster. The other way to introduce human expertise into Eq. (1) is using a meaningful distance with respect to the nature of the data.
If we can notice some works that extend Mean Shift to manifold space using kernels, as in [19] or [3], this point is usually not addressed directly in the literature, as the distance used has to be Euclidean (possibly up to a variable change). However, the reformulation of Mean Shift as a bound optimization, proposed in [7], suggests an interesting generalization to any (possibly non-Euclidean) distance. Note that convergence Mean Shift for a general kernel, even based on the Euclidean distance, is still an open problem as the proof given in [4] has been contested in [14] and in [9]. The closest result to the one we will show is, [11] which focuses on the more specific case of Epanechnikov mean shift. In the present paper, we clarify what is meant by “convergence" claimed by the different articles and what results have been proved, and situate our contribution in this general picture. Our contribution is threefold:
1. We make a synthetic review of mean shift theory.
2. We prove a new convergence result for the case of triangular kernel profiles, an immediate application of which is showing convergence of the Median Shift algorithm proposed in [18].
3. We propose a novel convergent mode seeking algorithm we call Wasserstein Median Shift, adapted to clustering of histogram representing distributions and which was the initial motivation for the present work.
The remaining of the article is organized as follows. In Section 2 we review the theory of mean shift, with a small focus on the generalization suggested in [7]. In Section 3, the core of our contribution, we study the specific case of the triangular profile and prove convergence of the returned sequence , apply it to Median Shift and introduce the Wasserstein Median Shift algorithm. Finally, in Section 4 we illustrate the theoretical results of the article on a synthetic and a real dataset.
This section summarizes the main results of the theory of mean shift. In 2.1, the equations of the algorithm are given along with their different interpretations. In 2.2, the equations of a generalized mean shift suggested in [7], applicable to non-euclidean distances, are derived. Finally, the main convergence results of the mean shift theory are listed in 2.3.
2.1 The mean shift algorithm
As explained in the introduction, the mean shift algorithm’s purpose is to find local maxima of the function defined by Eq.(1). To do so, given a point x of the data space, the algorithm builds a sequence starting at
that converges to a local maximum of the density function. This is achieved iterating the following step:
with where
is the opposite of the derivative of
. This process, the convergence of which to a mode of f(.) will be discussed in Section 2.3, can be given three interpretations: as a weight average, as a gradient ascend and as a bound optimization.
Mean shift as a weighted average Each mode estimate is a weighted average of the elements of the data set, with
the weight assigned to the element
. The profile k(.) being decreasing and convex, the function
appearing in the definition of the weights
is positive and decreasing: the further a point is from the current average, the lower its weight will be in the computation of the next average. Although insightful, this interpretation does not really help understanding why this process of local averaging would converge to a mode of the function defined by Eq. (1).
Mean shift as a gradient ascend Another interpretation is based on the remark that Eq. (2) can be re-written as:
with is the gradient of f(.) (checking this result is a direct computation). From this point of view mean shift is a gradient ascend with a step size
updated at each iteration.
Mean shift as a bound optimization The third interpretation, proposed in [7], assimilates mean shift to a bound optimization algorithm. The term bound optimization denotes a wide range of algorithms consisting in replacing a function to be optimized with an approximation
, having properties ensuring that optimizing
will increase the value of
[12]. Once the optimum of
has been found, it is used to build a new approximation
and the operation is iterated until convergence of the sequence. Seeing mean shift from this viewpoint, we can write Eq. (2) as:
where the function is defined as
It can be easily checked that zeroing the gradient of ends in Eq. (2), i.e., the classical formulation of mean shift. This point of view allowed the authors of [7] to revisit the main result of the mean shift theory (growth of the sequence
) in an especially insightful way. As noticed in this reference, the approach can be generalized to build a mean shift variant based on any (possibly non-Euclidean) distance. To our best knowledge the authors never wrote down nor used the algorithms obtained that way but their derivation, to which Sect. 2.2 below is dedicated, is a necessary step to the new results we will prove in Sect. 3.
2.2 Mean shift based on a general distance
Bibliographic remark 1. The ideas of this section are mentioned in the Discussion and Conclusions section of [7]. What follows was (to our best knowledge) never written in black and white, and their suggestion is not clear wether it is the Euclidean norm or the squared Euclidean norm that could easily be changed. However, what follows should be considered in all honesty a part of existing theory and a starting point for the results of Section 3, not a contribution of the present paper.
A benefit of the reformulation of mean shift as a bound optimization appears if we replace the squared Euclidean norm in Eq. (1) with a general function on the state space
when the norm of its first argument goes to infinity. The counterpart of Eq. (1) is then a p.d.f. of the form (omitting the factor
There is a priori no obvious transposition of Eq. (2) in this case. Conversely, adapting the bound optimization equations (3), (4) is straightforward and leads to the following iterative process:
with defined as:
Note that taking the minimum of in (6) (instead of the maximum of
) and removing the constant terms in (7), does not change the position of the optimum and leads to a simpler formulation of the same algorithm:
This generalization will be leveraged in Section 3.3 to build new versions of mean shift, better suited in particular to clustering histograms representing distributions and having convergence guarantees. For now, we can notice that classical properties of bound optimization hold, in particular Proposition 1 below.
Proposition 1. The sequence , is growing. Moreover, as it is also upper bounded, it is convergent.
Proof. Using Lemma 1 (see below) we have:
The proof of Proposition 1 used the following lemma:
Lemma 1. The functions verify the following properties:
(iii)
Proof. (i) is immediate using definition 7. (ii) is a straightforward consequence of (6). (iii) can be obtained as follows. The function k(.) being convex we have any
. Replacing
in the latter expression then summing over i we get for any x:
The last expression being exactly the function defined by Eq. (4) we obtain (iii).
Note that Prop. 1 does not imply convergence of the mode estimates , which has to be studied on a case-by-case basis. The existing convergence results for main variants of the Mean Shift algorithm will be summed up in Sect. 2.3.
2.3 Convergence of mean shift algorithms
In the previous subsection we generalized the main convergence result of the theory of mean shift (Prop. 1), verified by all variants of the method. It concerns , not the mode estimate
itself. In this subsection, we give a quick overview of the results obtained in the litteratture on the convergence properties of mean shift algorithms. We propose to classify these methods depending on their convergence guarantees. Three guarantee levels can usually be shown, from the lower to the higher:
1. The sequence is non-decreasing
2. The sequence converges to a stationary point of f
3. The sequence converges to a local maximum of f
We have of course is the minimal result of a mean-shift-based algorithm. 2 ensures we can use the algorithm as a clustering method: once we know the process is convergent we can define a cluster as a set of initializations x bringing the sequence
to the same limit value. 3 has been shown for Euclidean norm and Gaussian kernel in [9] and for Euclidean norm and Epanchenikov kernel in [11]. The current state of the theory of mean shift can be summed up by the following table, where the numbers 1, 2, 3 represent what result have been shown for each situation. Our contribution is to prove 2 for flat g(.) for any distance (circled 2 in the table).
This section contains the theoretical contributions of the present paper. The main result, established in Section 3.1, is the convergence of the generalized mean shift of Sect. 2.2 when the weighting function g(.) is flat (i.e. the profile k(.) is triangular). Actually, the sequence will even be shown to be stationary. In Sect. 3.2 and Sect. 3.3 two clustering algorithms are derived from this result: Median Shift (already proposed in [18] and now given a convergence proof) and Wasserstein median shift. The latter, which is the initial motivation for the work presented in this paper, is particularly well suited to histogram clustering.
3.1 Main result
In the present section, we derive the generalized mean shift equations in the case where the profile k(.) is triangular, and prove they generate a convergent sequence. We have here:
Applying formula (8) we obtain the following iteration:
where is a choice function that returns always the same x for a given
, and
. For a given n, we call
the active set. The only benefit in introducing
instead of writing
the same set will give the same
in case several minimizers exist. Now we can come to the main result of this section:
Theorem 1. The sequence defined by see Eq. (10) (i.e. generalized mean shift in the case of flat weights), is stationnary regardless of the distance used. It converges in a finite number of step.
The rest of this section will be dedicated to proving this result. Let us start with “candidates”:
Definition 1 (Candidates). We define as “candidates” all points x verifying for J a subset of indices. As there is a finite number of subsets J in the dataset, there is a finite number of candidates.
Obviously, each is a candidate, which means
evolves inside a finite set, as well as
its image through f. Using Prop. 1 we also know that
is growing, which implies:
Proposition 2. The sequence is stationnary.
Prop. 2 does not imply that is also stationary, as it could endlessly shift betwen different values having the same image through f(.). We need the following technical proposition:
Proposition 3. At a step , if a point is added to the active set, then
strictly grows even if some other points are removed. More formally, if (at least) one data point index
then we have
Proof. The set can be cut as
or as
. Splitting the sum
over these two decompositions of the set
we obtain the equality:
Let us study each term. We have the following results:
• Proof: by definition of
• Proof: by definition of
• Proof: By definition,
over all possible values of x, so that we have in particular
• , with zero obtaind only if
is empty. Proof: by definition of
we have
In the case where is empty we have
An immediate consequence of Proposition 3 is that if is non-increasing after a given step
then no data point can be added to
for
, i.e. we have
for
. In particular, the cardinal of
is non-increasing after rank
. But this cardinal is a positive integer, so it is necessarily stationary after a rank
. For
no point can be added to nor removed from
and this set becomes stationary, in a finite number of steps. Remembering Eq. (10) we conclude that
is stationary and prooves of Theorem 1.
The application which motivated the present work is clustering of histograms, for which the Wasserstein metric defined in Section 3.3 by Equation (13) is much more meaningful than the Euclidean distance. We will see that working with the Wasserstein metric can be reduced to working with the -norm, so we first analyze this case in Section 3.2.
3.2 Application 1: Median shift
In this subsection, we present median shift, a L1 variant of mean shift. The algorithm was already presented in [18]. The authors did notice a lot of advantages of median shift upon mean shift and make interesting experiments, however, they do not demonstrate its convergence.
Using -norm) and the triangular kernel (defined by
u, 0)), the p.d.f. of Eq. (5) becomes:
where denotes the
is the set of indices i for which we have
. Eq. (8) defining the generalized mean shift iteration becomes:
with , and Med(.) the term by term median of a set of points, as the median minimizes the L1 norm. The general result of Sect. 3.1 applies and ensures convergence of this “median shift” algorithm
Proposition 4. The sequence of mode estimates returned by Eq. 11 is convergent (more specifically it is stationnary) for any initialization, as a direct application of theorem 1.
3.3 Application 2: Wasserstein median shift
The Wasserstein metric, also called the Earth Mover’s Distance, is a distance function between probability distributions on a set U, linked to optimal transport theory. We focus here on the specific case where U is a finite set of cardinal q, and the probability distributions are approximated by normalized histograms. We do not enter here in the generalized form of Wasserstein metric, as it is not the main point of the paper. The intuition is that this distance it takes into account the order of the histogram’s bins, as the difference of levels between close bins as less impact than the difference on far bins.
Under these hypotheses, we can define the Wasserstein distance between two histograms M and is the set of normalized histograms:
1}, but it can be better defined in the space of cumulated histograms:
Let us introduce two operators linking histograms to cumulated histograms:
with the convention . We have of course:
The Wasserstein metric can then be shown to take the simpler form below:
A natural way of obtaining a minimizer for Eq. 8 is to compute it in the space of cumulated histograms, then come back to the space of normalized histograms using the function diff. All we need to check is that the term-by-term median of a set of cumulated histograms is still a cumulated histogram.
Proposition 5. The term-by-term median of a set of cumulated histograms is still a cumulated histogram.
Proof. Let be a set of cumulated histograms and
its term-by-term median defined by
(coordinates are denoted with a superscript). For
to be a cumulated histogram we have to check
and that
is growing. The two first properties are obvious ( the median of numbers all equals to one is one, the median of positive numbers is positive). To show the third required property we notice that the “median” function is growing w.r.t. each of its arguments. Thus, having
for each i implies Med
. This is all we needed to establish that the median of a set of cumulated histograms is still a cumulated histogram.
Proposition 6. If the distance function in the equation (8) is the Wasserstein metric
defined by Equation (13), then the optimization step in (8) is solved by:
An equivalent, but more practical formulation of the algorithm is the following:
where . Proposition 6 provides a closed-form formula for the optimization problem appearing in Eq. (8) when used with the Wasserstein metric. It makes the adaptation of Mean Shift tractable in practice, but also suggests directly working on the cumulative histograms, which gives Algorithm 1 (Wasserstein median shift). Following Proposition 4, the obtained estimate
is again stationary.
In this section, we present results on two datasets, one on synthetic data, with a pure illustrative purpose, and one on real aeronautical data. We compare our wasserstein median shift algorithm
Figure 1: Two samples of the two classes of histograms generated for the experiment described in Section 4. Left: class 1 ; right : class 2. We see their difference is obvious. Yet, the classical mean shift based on the -norm performs very poorly on this data set, contrary to the Wasserstein median shift.
with classical mean shift and other classical algorithms of clustering. We do not claim that our algorithm works better in all cases, (a property which cannot be true for any clustering algorithm), but we show here some cases where it can outperfom classical methods. The correlation between the obtained results and the ground truth is assessed using the Adjusted Rand Index (ARI [17]). An ARI of 1 means that two clusterings are identical, an ARI close to 0 (or negative) expresses a total dissimilarity between the two clusterings.
4.1 Datasets
To illustrate the flaws of the distance for histogram clustering, we build a data set consisting of empirical histograms, each computed using 100 samples of a random variable. The law of the random variable is different for each histogram, but belongs to one of the two following classes.
Class 1: Gaussian variables having similar means and the same variance.
Class 2: Mixtures of two Gaussian variables, one of them chosen as in class 1.
For Class 1, the Gaussian have their means ranging from 0.47 to 0.53. For Class 2, the two components of the mixture have weights of 0.8 and 0.2 respectively. Their means range from 0.47 to 0.53 and from 0.17 to 0.23 respectively. The standard deviation is 0.02 for all Gaussian. For example, samples of these two classes of histograms are displayed on Figure 1, and, looking at it, retrieving the two classes is a priori an easy task.
The second situation we considered is the application framework which motivated the research presented in the paper. A fleet of helicopters is used by a company for several kinds of missions, and we look for a clustering algorithm able to retrieve the various use cases from in-flight recorded data. Instead of using long and complexe descriptors on temporal data, we used histograms, easier to compute on large sets of data, and more understandable. The dataset we used, provided by an helicopter company, contains 122 time series representing the evolution, during a mission, of the altitude of a helicopter. They cannot be directly compared as vectors because their lengths are different. Note that the general issue of comparing time series having different sizes is difficult in itself, and several responses such as Dynamic Time Warping have been brought in the past. The solution we chose here is comparing the series through their histograms.
Figure 2: True clusters of the dataset, and outliers in the last square.
4.2 Experiments
We tested 8 techniques from scikit-learn: Mini Batch K-means (MBKM), Affinity Propagation (AP), Spectral Clustering (SC), Agglomerative Clustering with Ward (ACW), Agglomerative Clustering with average linkage (ACA), DBSCAN (DB), Birch (Bir), and Gaussian Mixture (GM) and compare to Wasserstein Median Shift (ours, WMS) and Mean Shift (MeaS). Some of them being stochastic, we launched them 100 times and gave the standard deviation of the ARI. For all methods, when a parameter should be given, we gave to the algorithm reasonnable value (for example, we gave exact number of clusters for synthetic data). The results are given in the Table 4.2. For both datasets, Wasserstein Median Shift outperforms all other clustering algorithms. We can also see that some algorithms perform better on real than synthetic data. This is due to our synthetic dataset being specifically designed to show the L2-norm can be tricked with mere Gaussian laws. The only exception is DBSCAN, which works for synthetic data and could not work on real ones (we tried with several value, but none worked). This result could be explain by the fact that in our real data there are samples linking clusters, making difficult for DBSCAN to separate them.
Table 1: Results on real and synthetic dataset.
We also went further and wrote a “Wasserstein version” of the algorithms for which it could be easily done: K-means-Wasserstein (KMWS) and DBSCAN-Wasserstein (DBSCAN WS). All results are presented in the table 4.2 We can see that the Wasserstein versions of the algorithms work better on synthetic and real data than their counterparts. Nevertheless, modified algorithms are still outperformed by Wasserstein Median Shift, which hypothesis fit better to the dataset. Our understanding of the difficulties encountered with
-norm for histogram clustering is that the L2-norm is invariant by histogram bins permutations while the order of the bins is of great significance for histograms.
Table 2: Results with only Wasserstein algorithms.
In this work, after a quick review of the results on mean shift theory, we introduced a new class of algorithm with convergence properties. We proposed two applications : the median shift algorithm introduced in [18] to which we give a convergence proof, and a novel algorithm : the Wasserstein median shift. We also provide some experiments to show the usefullness of the latter. In further works, we intend to look at extensions of the algorithms, to obtain clustering with different bandwith regarding the samples, and the possibility to extend to other profile than the triangular profile.
[1] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: ordering points to identify the clustering structure. In ACM Sigmod Record, volume 28, pages 49–60. ACM, 1999.
[2] Geoffrey H Ball and David J Hall. Isodata, a novel method of data analysis and pattern classification. Technical report, DTIC Document, 1965.
[3] Hasan Ertan Cetingul and René Vidal. Intrinsic mean shift for clustering on stiefel and grassmann manifolds. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1896–1902. IEEE, 2009.
[4] Dorin Comaniciu and Peter MEER. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, pages 603–619, 2002.
[5] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
[6] Edwin Diday. The dynamic clusters method in nonhierarchical clustering. International Journal of Computer and Information Sciences, pages 61–68, 1973.
[7] Mark Fashing and Carlo Tomasi. Mean shift is a bound optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):471–474, 2005.
[8] Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1):32– 40, 1975.
[9] Youness Aliyari Ghassabeh. A sufficient condition for the convergence of the mean shift algorithm with gaussian kernel. Journal of Multivariate Analysis, 135:1–10, 2015.
[10] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, volume 27, pages 73–84. ACM, 1998.
[11] Kejun Huang, Xiao Fu, and Nicholas D Sidiropoulos. On convergence of epanechnikov mean shift. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[12] David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30–37, 2004.
[13] Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
[14] Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift. Pattern Recognition, 40(6):1756–1762, 2007.
[15] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, pages 139–137, 1982.
[16] Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003–1016, 2002.
[17] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
[18] Lior Shapira, Shai Avidan, and Ariel Shamir. Mode-detection via median-shift. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1909–1916. IEEE, 2009.
[19] Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pages 705–718. Springer, 2008.