Clustering by Deep Nearest Neighbor Descent (D-NND): A Density-based Parameter-Insensitive Clustering Method

2015·arXiv

Abstract

2 Motivation and Idea of this paper

Despite that DG, ND and H-NND could largely reduce the risk of the under-smoothed density or potential to the ultimate clustering results, while the risk of the over-smoothed case to the clustering result still exists. In other words, these methods could still be affected by the kernel bandwidth, despite the “safe range” for the bandwidth has become relatively larger. Theoretically, one can adjust the bandwidth, but judging whether the current bandwidth leads to a better clustering result is nontrivial. Besides, there are cases, e.g., multiple scales (or resolution) in different clusters, or unbalanced element numbers in different clusters, that one fixed kernel bandwidth cannot well reveal the underlying density. That is why we previously proposed a visualization method, called IT-map (27), to help set a more reliable bandwidth or different bandwidths for different clusters by using the divide-and-conquer strategy under the supervision of users. Besides, one can expect a kernel density estimation method that could make the bandwidth adaptive to the local distributions of data points.

Nevertheless, is it possible that the bandwidth problem (both the associated over-smoothed and under-smoothed risks for the density estimation) could be solved all by the density-based clustering method itself rather than the help of IT-map or adaptive kernel density estimation methods? In order to fulfill this goal, we should first answer such question: what is the cause for the bandwidth problem? In our perspective, the answer lies in the fact that the two processes involved, i.e., density estimation and clustering, are separate and performed in a serial way (as the questions in Section 1.1 reveals) for those density-based clustering methods DG, ND and H-NND (the latter two can also be viewed as density-based methods), and consequently the clustering process largely relies on the performance of the density estimation.

Therefore, in this paper, we will propose a density-based hierarchical clustering method, called the Deep Nearest Neighbor Descent (D-NND), in which the above two processes will interplay. This is fulfilled by making full use of the hierarchical strategy and the merit of NND. Specifically, in each layer of the hierarchy, the density (actually we still use the potential form in D-NND) on certain nodes are updated based on the discovered cluster structure, and in turn, the updated density estimation will be used to renew the cluster structure. The density in each layer is estimated based on the local information, whereas as the layer number increases, the magnitudes of the estimated potentials on the sample nodes in the higher layers could gradually grasp the global density distribution in the dataset. By this way, the proposed method could be adaptive to the multi-resolutions of the different clusters.

D-NND is expected to not only largely reduce the negative effect made by the un-der-smoothed potential estimation, a property inherited from ND and H-NND, but also largely reduce the risk of over-smoothed case. In effect, the proposed method could appear insensitive to the parameters in considerably large ranges of values.

3 Method

3.1 The details of the proposed method: D-NND

As summarized in Table 1, D-NND takes as input the distance between any pair of data points , {1, , }. At the beginning, all points are of zero potential (Note that in order to use NND, we use the term potential rather than density , whereas = −), and , where X and Y respectively denote the input and output dataset in each layer.

Each layer contains 5 steps. First, the neighbor nodes (denoted as ) of each node are identified by constructing the Neighborhood Graph such as k-Nearest-Neighbor graph7 in which each node selects the k nearest nodes as its neighbors. Then, the local potential on each node is computed via summing the dissimilarities between point i and all its neighbors plus the history potential of node i in the last layer. Then comes NND method, which is divided into two steps 3 and 4. In Step 3, the candidate parent node set of each node is defined, and the nodes (corresponding to the locally extreme points) with null is denoted as dataset Y . In steps 4, the parent node of each node is identified. The root nodes in the dataset Y will serve as the input for the next layer. The last layer occurs in the time when there is only one root node (denoted as node r) in Y .

In conclusion, given the input nodes in X, each layer functions to: (i) update the potentials of the nodes in X ; (ii) identify the root nodes Y ; (iii) identify the parent nodes for the remaining nodes in .

As illustrated in Fig. 8A, if we connect each node to its parent node and view the points in Y as the root nodes (the red points in Fig. 8A), in effect, data points are nested layer by layer (Fig. 8A, from left to right) and the number of root nodes is reduced layer by layer, until all the data points are organized into the fully connected IT data structure with only one root node left (Fig. 8A layer 3). For the IT data structure, the data points in the original input {1, , }correspond to the nodes in it. Each node i (except node r) and its parent node respectively defines the start and end nodes of one directed edge () and the distance , between them defines the edge length . Note that, each directed edge () in IT is the only directed edge, denoted as , started from node i. In other words, the start node of each edge in IT can serve as the unique identifier of each edge. This is the basis for the Decision-Graph-Cut in the Top-down stage.

Table 1. The bottom-up stage of D-NND

Procedure:

1. Identify the neighbor nodes of each node

2. Estimate the potential on each node

Top-down stage (removing the redundant edges):

In order to divide the dataset into groups (i.e., clustering), the fully connected IT requires to be divided into pieces (each representing a cluster) by removing the redundant edges in IT.

Pleasingly, like the IT structures constructed by ND (Fig. 5C) and H-NND (Fig. 7), the IT generated here (Fig. 8A, layer 3) also shares the good property that the redundant edges (cyan) are distinguishable from other edges, and thus they are not hard to be determined, as Fig. 8B shows, the redundant edges can be easily determined by two different methods: E-cut and Decision-Graph-Cut. E-cut is the plot of the lengths of all edges in IT in decreasing order (Fig. 8B, left). Decision-Graph-Cut (see also section 1.5) features each directed edge in IT by two associated variables, its edge length and the potential of its identifier (i.e., its start node), as the right image in Fig. 8B shows, the pop-out points (interactively determined in the red box) could rightly correspond to the redundant edges in IT. As Fig. 8C shows, the above two methods lead to the same and almost perfect clustering result.

In fact, besides E-cut and Decision-Graph-Cut, we have previously devised other effective methods (with different advantages) to remove the redundant edges in IT, e.g., IT-map (27), IT-Dendrogram (28), G-AP (29) and SS-cut (8).

For simplicity, we will still use the E-Cut and Decision-Graph-Cut only in the following experiments to demonstrate the power (effectiveness and insensitivity to the parameters) of the proposed method D-NND. The saliency of the redundant edges and the diversity of the edge removing methods could guarantee the good performances of clustering results.

Fig. 8 An illustration for D-NND on a dataset with N = 5000 data points. (A) From left to right, as the layer increases from 0 to 3, the number of the root nodes (red) was successively reduced from 5000, 418, 15 to 1, and at the same time, the whole data points were gradually organized into the fully connected graph, i.e., the IT (the rightmost image). For clustering purpose, the redundant edges (cyan) in IT require to be removed, which are, however, obviously distinguishable (with longer lengths at least) from the other edges and thus can be easily determined. (B) Different methods to determine the redundant edges in IT. Left (E-Cut): the plot of the lengths (in decreasing order) of all edges. The K longest edges that need to be removed can be easily determined according to the plot by setting a threshold between points with a large gap or just by counting the number of the points with saliently large values (in red). Right (Decision-Graph-Cut): the interactively determined pop-out points (blue) correspond to the start nodes of the redundant edges. These two methods lead to the same clustering result (C). The almost perfect clustering results demonstrate the effectiveness of the whole process. In this illustration, we set 100.

3.2 Parameters

The proposed method has at most 2 parameters, one is the neighbor number k, if the k-Nearest-Neighbor graph (k-NN) is used to define the neighborhood relationship in step 1; the other is a kernel-like parameter σ, if the exponential function is used to compute the dissimilarity in step 2. We will show in the experiments that, the proposed method is not sensitive to both in an extremely large range. In fact, the parameter could now function like the “fine-tune knot”, which is, however, a good thing from the technical point of view.

4 Experiments

We first tested three 2D datasets (Fig. 9) from (30), (7), (31), respectively, using large range of values for both . For the 1st dataset: k was varied from 5 to 50 while = 1 or 10000. For the 2nd dataset: k was also varied from 5 or 50 while = 1 to 10000. For the 3rd dataset: k was varied from 5 to 40 while = 1 to 10000; Note that, for the first dataset, the data in each dimension were normalized to [0 1] like what we did in Fig. 8. In all cases, there are points popping out in the Decision Graphs and the corresponding clustering results are all consistent with our visual perception as Fig. 9 shows. In fact, it can be seen that the same clustering performances can also be achieved by using the E-Cut, since the edge length variable alone (vertical axis in Decision Graphs in Fig. 9) is enough to distinguish those pop-out points (or the corresponding redundant edges). In order to further demonstrate this, we did such experiments on the first dataset that k varies from 2, 10 to 40 and varies from 0.1, 100 to 10000. In each case, the 14 longest edges in the IT structures were removed, which results in 15 clusters. By comparing the clustering assignments of all results with the benchmark data, the average error rate for the clustering assignments is almost negligible: 0.0057 0.0006 (mean standard deviation). In other words, all of the clustering results are almost perfect when both vary in large ranges of values.

Fig. 9. Tests on three 2D datasets. The parameters , especially σ, vary in large ranges of values for each dataset. Euclidean distance was used to compute the pair-wise distance.

Then, we tested a set of (five) high-dimensional datasets from (32), each containing N = 1024 data points (sampled from M = 16 Multivariate Gaussian functions), whereas the dimension d varies from 32, 64, 256, 512, to 1024. For each dataset, we chose two largely different values for k (= 5 or 500), together with two extremely different values for (= 1 or 100000). We used E-Cut for all cases. The plots of the edge lengths are shown in Fig.10. In each plot, there is a salient gap between the determined number of largest edge lengths (in red) with the rest ones by setting an appropriate threshold in the gap. Almost all of the corresponding clustering results are perfect. The clustering error rate for all cases are 0 and the cluster numbers for most of them are consistent with the underlying number (M = 16) of groups, except few of them with slightly larger cluster numbers, being either 17 or 18. However, we found that the extra cluster(s) contains only one point, actually being regarded as the outlier.

Fig. 10. Tests on five high-dimensional datasets (from left to right, d = 32, 64, 256, 512, 1024). The parameters vary in large ranges of values for each dataset. Euclidean distance was used.

We also tested the United State Poster Service (USPS) digit number dataset, which contains N = 11000 grayscale handwrite digits. Each digit is a 16 × 16 image, treated as a 256-dimensional vector in the test. Here, we used the Decision-graph-Cut, expecting to reach small error assignment with small cluster number. We also selected five significantly different values for k (= 2, 5, 10, 20, or 50) and two extremely different values for (= 1 or 100000). The testing results for all cases are shown in Fig. 11.

Fig. 11. Tests on the USPS digit number dataset with large ranges of values for parameters k and σ, especially σ. Cosine distance was used.

5 Conclusions

In Section 1, we first summarized two general questions (i.e., the questions Q1 and Q2) of the density-based clustering and accordingly introduced seven density-based clustering methods, i.e., DENCLUE, Mean-Shift, Graph-GA, Rodriguez and Laio’ s DG, and the methods proposed by us before (i.e., ND, NND and H-NND), and at the same time we introduced (i) four problems (P1~P4) of DENCLUE or GA; (ii) how other methods partly solve those problems; (iii) the relationship between DG and ND: DG can be viewed as the Nearest Ascent (NA), the counterpart of ND; (iv) the relationship between the newly proposed novel methods (DG, ND and NND, since 2014) and previous methods (DENCLUE, Mean-Shift and Graph-GA): NND could serve as the tie for them. In Section 2, we explained the motivation (a density-based clustering method that can solve the density estimation problem all by itself) and the main idea of this paper (taking the two processes, density estimation and clustering, interplay). In Section 3, we introduced the proposed method D-NND in details. In Section 4, we showed in the experiments that D-NND has not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters in large ranges of values.

Although NND can only avoid the first three problems of GA, via the proposed hierarchical learning framework (unsupervised), the last problem of GA is also largely solved by the proposed method D-NND. This endows D-NND with such advantages: non-iterative, unconstrained by the attributes of the data, and insensitive to kernel bandwidth, or in other words, the efficiency, general meaning, and reliability. Besides, D-NND also shares the seven general merits (in the beginning of Section 1) of most of the density-based clustering algorithms.

6 Discussions

6.1 Bottom-up exemplar election.

Like the affinity propagation (AP) (33), the bottom-up stage of the proposed D-NND can also be interpreted as the process of selecting exemplars (or representatives). In this prospective, Fig. 8A can be analyzed like this: at the beginning, all points are viewed as the exemplars of themselves. After comparing among the exemplars, several of them (local density points) will continue to be the exemplars of the other exemplars in the next layer. This process proceeds in the higher layers until one exemplar becomes the exemplar of all the remaining exemplars in certain layer. Thus, the whole bottom-up stage can be roughly viewed as a hierarchical exemplar election system. Since in each layer only the local information is considered, this could make the selection in each layer to be local optimum. And as the layer increases, the exemplars could actually represent large ranges of areas of points from a global perspective. In other words, the selection will gradually become global optimum in the higher layers.

6.2 “Center-biased” trend for the exemplars guarantees the saliency of the generated redundant edges.

Before being replaced (in the higher layers) by the exemplars in other clusters, the exemplars in the same clusters will gradually evolve to the centers of the clusters, forming a “center-biased” phenomenon (Fig. 8A), which guarantees the saliency of the generated redundant edges (Fig. 8A, layer 3), since each redundant edge will approximately start from the center of one cluster and end in the center of another cluster. This is in stark contrast to the case for the MST in which the redundant edges usually occur in the marginal areas of clusters and thus usually being hard to be determined. Note that, Fig. 8A shows the case that all undesired edges are generated simultaneously in the last layer, but there is also case that the redundant edges (cyan) are generated in different layers as Fig. 12 shows.

Fig. 12. Test on the unbalanced dataset8. k = 30, σ = 1;

6.3 Insensitive to parameters.

On the one hand, since density estimation in each layer is based on the neighborhood relationship, the kernel is constrained in a local sphere and thus, no matter how large the kernel-like parameter σ is, it makes very litter effect on the global distribution and thus the over-smoothing density phenomenon is largely avoided, provided that the neighborhood relationship is to some degree guaranteed (for the nonparametric neighborhood graph such as Delaunay Graph, this is always valid; and for k-NN graph,

k should not be too large). On the other hand, such local constraint for the density estimation would make the under-smoothed density problem (in the lower layer) even worse. For DENCLUE (or GA), Mean-Shift, Graph-GA, and NND, this would mean a severer over-partitioning result. However, as revealed by the experiments, this un-der-smoothed case is not so troublesome to D-NND either, due to the hierarchical procedure (which gradually improves this situation) and the strategy of redundancy design (which largely reduces the effect of the ripple noise). Overall, the prevention for the over-smoothed density estimation and the ability to tackle with the problem (ripple noise) brought by the under-smoothed case make the proposed method insensitive to the parameters (especially the kernel-parameter) in large ranges.

6.4 NND vs. GD: particularity vs. generality

For one thing, hierarchical strategy largely solves the over-partitioning problem of NND. For another, the role of NND is fully played and revealed, as a significant element in each layer. The reasons are analyzed as follows. Suppose that we use in each layer the Gradient Descent (GD) (i.e., the counterpart of GA), rather than NND, then the first two problems (i.e., Q1 and Q2 in Section 1) of GA will also exist here. And as the layer number increases, the overall problems will be more nontrivial in this hierarchical system. In contrast, given the estimated potentials on nodes, NND brings in no additional problem, no matter how many layers the hierarchy contains.

In our perspective, GD provides a general solution to the problem of locating the maximum points in the density function, i.e., the traditional optimization problem. In contrast, NND grasps the “particularity”. This “particularity” mainly refers to the fact that the value set of the independent variable for the assumed underlying density function in this particular problem here is finite, discrete and known (i.e., the input dataset). For this reason, a graph (or map) could be constructed after the first independent hopping of each point to its parent node. In fact, this “particularity” can also explain why previously there could also exist other alternatives of GA, i.e., Mean-Shift and Graph-GA.

computation time in this task. The reason is that although GD (or GA) lets each point choose the steepest path to reach the density peak, each step of approaching the density peak involves the equivalent computation. In comparison, for NND, only the first step involves a certain degree of computation, but the computation time by the remaining steps are almost negligible due to the pre-specified paths on the constructed graph. Even compared with Graph-GA, the graph-based approximation of GA, the “steepest pursuit” also makes no considerable advantage in computation time. Admittedly, the path to the root node for each node in the graph constructed by Graph-GA is generally nearer to the “steepest path” and thus contains less directed edges, as compared with NND. This is illustrated in Fig. 13. For the same point (in blue), the path in the left image (Graph-GA) obviously contains less edges than that in the right image (NND). And overall, the number of the total edges generated by Graph-GA is less than that by NND. Nevertheless, this brings Graph-GA negligible advantage in terms of the computation time in this graph-based searching, as compared with the similar case in NND, since, for NND, the computation time for such kind of graph-based search could already be negligible in general, revealed in Table S2 in (8), where9, for instance, the computation time on a dataset with N = 8124 data points costs no more than 0.005s. Note that although the graph in (8) is not constructed by NND, the way of searching the root nodes is of the same process as here.

Fig. 13. Comparison between Graph-GA and NND in terms of the path. The graph generated by Graph-GA contains less edges.

Beside, this “steepest pursuit” is also not more efficient in terms of the parameter in this task. The reason can be revealed by comparing NND and Graph-GA. Compared with Graph-GA, NND is less sensitive to the parameter due to the constraint of the “nearest” requirement. This has been partly shown in H-NND for its insensitivity to the neighborhood parameter in a large range, and this is also fully revealed by the stark comparison between NND (left column) and Graph-GA (right column) in Fig. 14 with different values of the neighborhood parameter k. Note that, we use the k-NN graph for both of them to define the neighborhood relationship. Besides, in the literature of Graph-GA (4), the authors first estimate the density based on the neighborhood relationship. Here in Fig. 14, in order to compare NND and Graph-GA, we assume that for both NND and Graph-GA, the underlying densities have already been well estimated as what Fig. 1B shows. We can see that, compared with the corresponding result in Fig. 13 ( k = 10 ), when k is increased to 20 in Fig. 14, there is no significant change for NND, whereas the problem (denoted by the red circle) starts to occur for Graph-GA. When k = 200 or 787, only four or seven undesired edges (in cyan) are falsely generated by NND, whereas the corresponding results for Graph-GA are severer. Besides, for NND, by removing the largest four or seven edges, good clustering results will be obtained, whereas, for Graph-GA, when all the redundant edges are successfully removed, this would lead to severe over-partitioning clustering results. Besides, note that when 1 (= 787 here), NND becomes ND, which is obviously not a too bad circumstance.

Fig. 14. Comparison between Graph-GA and NND in terms of parameter. NND presents much better performance while the value of k is increased.

6.5 The Fault-tolerant design.

Unlike DENCLUE, Mean-Shift, Graph-GA and NND, D-NND does not seek to directly get the clustering result once and for all. Instead, it follows such Fault-tolerant

how to have data effectively organized first (referring to constructing the IT structure here) with the tolerance of making “mistakes” (referring to the redundant edges in IT), and then seek to “repair them” (referring to removing the redundant edges here).

In our perspective, although hierarchical strategy makes H-NND generate more distinguishable redundant edges and D-NND more robust to the parameters compared with ND, the most contributing factor for D-HHN, together with ND (or Rodriguez and Laio’s DG) and H-NND, is this Fault-tolerant design. The reasons are analyzed as follows.

enriched by a variety of methods (e.g., different methods to construct IT: ND, H-NND, D-NND; and different methods to remove the redundant edges in IT: E-Cut, Deci-sion-Graph-Cut, IT-map, G-AP, etc.). Similarly, in our opinion, the Fault-tolerant design could also be used to explain the enrichment of the traditional link-based hierarchical clustering (L-HC) methods in which different methods (e.g., single linkage, complete linkage, average linkage) have been proposed to construct the Dendrogram. Despite the similarity in terms of the Fault-tolerant design, however, compared with L-HC, the advantages of the IT-based clustering are twofold: (i) for this Fault-tolerant design, one thing vital is that how the “mistakes” are designed or will be generated, since the more distinguishable the mistakes are, the easier and more reliable the following repair methods could be. For ND, H-NND and the proposed me-

thod D-NND, this is just their biggest advantage, since the redundant edges could in general be distinguishable, in sharp contrast to the traditional L-HC (especially the single-linked one, which is generally equivalent to MST). (ii) Besides the saliency of the redundant edges, the diversity of the repair methods is another advantage for the IT-based clustering methods compared with L-HC, since a set of effective repair methods (as mentioned in the above paragraph) can be used to help remove the error links in this particular graph structure, IT, also in sharp contrast to the only choice (i.e., the Dendrogram) for L-HC.

It seems that this Fault-tolerant design could be a quite efficient way to reach a certain degree of robustness while at the lowest cost and using a simpler system.

6.6 Think globally, while learning locally in hierarchy.

Like the idea (“think globally, fit locally”) in the famous dimensionality reduction method locally linear embedding (LLE), here we also use the local information to estimate the density (characterized by potential in this paper), despite it is a global feature. Nevertheless, in this special problem, we also combine such local estimation with the hierarchical strategy so that, as the layer increases, the magnitudes of the estimated potentials on the points in higher layer can gradually reveal more and more global density features. Also, since, in each layer, the potential is updated based on the local neighborhood relationship (which is relatively reliable) instead of the sin-gle-scaled kernel bandwidth (in fact the effect of the bandwidth has been largely constrained, see Section 6.3), the density estimation could be in effect adaptive to the multiple scales in the clusters.

In fact, in our opinion, the successes of such methods as Segmentation by Weighted Aggregation (SWA) (34) and Tree Preserving Embedding (TPE) (35) in solving the similar multi-resolution challenges in other unsupervised tasks (image segmentation and dimensionality reduction, respectively) are also largely benefited from the similar idea reflected in this D-NND, that is, Thinking globally, while learning locally in hierarchy.

References

1. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8):651-666.

2. Xu R & Wunsch D (2005) Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3):645-678.

3. Theodoridis S & Koutroumbas K (2009) Pattern Recognition, Fourth Edition (Academic Press Elsevier).

4. Koontz WL, Narendra PM, & Fukunaga K (1976) A graph-theoretic approach to nonparametric cluster analysis. IEEE Trans. Comput. 100(9):936-944.

5. Ester M, Kriegel H-P, Sander J, & Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, pp 226-231.

6. Hinneburg A & Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. KDD, pp 58-65.

7. Rodriguez A & Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492-1496.

8. Qiu T, Yang K, Li C, & Li Y (2014) A Physically Inspired Clustering Algorithm: to Evolve Like Particles. arXiv preprint arXiv:1412.5902.

9. Qiu T & Li Y (2015) Clustering by Descending to the Nearest Neighbor in the Delaunay Graph Space. arXiv preprint arXiv:1502.04502.

10. Qiu T & Li Y (2015) H-NND: A New Member of the In-Tree (IT) Clustering Family. arXiv preprint arXiv:1509.02805.

11. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam, J. Neyman, Eds, (Univ. California Press, Berkeley, CA), pp 281-297.

12. McLachlan G & Peel D (2000) Finite Mixture Models: Wiley Series in Probability and Mathematical Statistics. (John Wiley & Sons, Inc).

13. McLachlan GJ & Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. Advances in pattern recognition, (Springer), pp 658-666.

14. Peel D & McLachlan GJ (2000) Robust mixture modelling using the t distribution. Statistics and Computing 10(4):339-348.

15. Dempster AP, Laird NM, & Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological):1-38.

16. Fukunaga K & Hostetler L (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1):32-40.

17. Cheng Y (1995) Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on 17(8):790-799.

18. Comaniciu D & Meer P (2002) Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(5):603-619.

19. Elgammal A, Duraiswami R, & Davis LS (2003) Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on 25(11):1499-1504.

20. Carreira-Perpińán MÁ (2006) Acceleration strategies for Gaussian mean-shift image segmentation. Computer vision and pattern recognition, 2006 IEEE Computer Society Conference on, (IEEE), pp 1160-1167.

21. Georgescu B, Shimshoni I, & Meer P (2003) Mean shift based clustering in high dimensions: A texture classification example. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, (IEEE), pp 456-463.

22. Yang C, Duraiswami R, DeMenthon D, & Davis L (2003) Mean-shift analysis using quasiNewton methods. Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, (IEEE), pp II-447-450 vol. 443.

23. Paris S & Durand F (2007) A topological approach to hierarchical segmentation using mean shift. Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, (IEEE), pp 1-8.

24. Shneiderman B (2014) The big picture for big data: visualization. Science 343(6172):730.

25. Gross JL & Yellen J (Handbook of Graph Theory (CRC press, Boca Raton, FL, 2004).

26. Zahn CT (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 100(1):68-86.

27. Qiu T & Li Y (2015) IT-map: an Effective Nonlinear Dimensionality Reduction Method for Interactive Clustering. arXiv preprint arXiv:1501.06450.

28. Qiu T & Li Y (2015) IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family. arXiv preprint arXiv:1507.08155.

29. Qiu T & Li Y (2015) A Generalized Affinity Propagation Clustering Algorithm for Nonspherical Cluster Discovery. arXiv preprint arXiv:1501.04318.

30. Fränti P & Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recognit. 39(5):761-775.

31. Gionis A, Mannila H, & Tsaparas P (2007) Clustering aggregation. ACM Trans. Knowl. Discovery Data 1(1):4.

32. Franti P, Virmajoki O, & Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(11):1875-1881.

33. Frey BJ & Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972-976.

34. Sharon E, Galun M, Sharon D, Basri R, & Brandt A (2006) Hierarchy and adaptivity in segmenting visual scenes. Nature 442(7104):810-813.

35. Shieh AD, Hashimoto TB, & Airoldi EM (2011) Tree preserving embedding. Proc. Natl. Acad. Sci. U.S.A. 108(41):16916-16921.

Designed for Accessibility and to further Open Science