Explainable $k$-Means and $k$-Medians Clustering

2020·arXiv

Abstract

1 Introduction

A central direction in machine learning is understanding the reasoning behind decisions made by learned models [32, 40, 42]. Prior work on AI explainability focuses on the interpretation of a black-box model, known as post-modeling explainability [10, 47]. While methods such as LIME [46] or Shapley explanations [35] have made progress in this direction, they do not provide direct insight into the underlying data set, and the explanations depend heavily on the given model. This has raised concerns about the applicability of current solutions, leading researchers to consider more principled approaches to interpretable methods [48].

We address the challenge of developing machine learning systems that are explainable by design, starting from an unlabeled data set. Specifically, we consider pre-modeling explainability in the context of clustering. A common use of clustering is to identify patterns or discover structural properties in a data set by quantizing the unlabeled points. For instance, k-means clustering may be used to discover coherent groups among a supermarket’s customers. While there are many good clustering algorithms, the resulting cluster assignments can be hard to understand because the clusters may be determined using all the features of the data, and

(a) Optimal 5-means clusters

Figure 1: The optimal 5-means clustering (left) determines uses combinations of both features. The explainable clustering (middle) uses axis-aligned rectangles summarized by the threshold tree (right). Because the clusters contain nearby points, a small threshold tree makes very few mistakes and leads to a good approximation. The benefit of explainability would be more apparent in higher dimensions.

there may be no concise way to explain the inclusion of a particular point in a cluster. This limits the ability of users to discern the commonalities between points within a cluster or understand why points ended up in different clusters.

Our goal is to develop accurate, efficient clustering algorithms with concise explanations of the cluster assignments. There should be a simple procedure using a few features to explain why any point belongs to its cluster. Small decision trees have been identified as a canonical example of an easily explainable model [40, 42], and previous work on explainable clustering uses an unsupervised decision tree [13, 21, 23, 24, 33]. Each node of the binary tree iteratively partitions the data by thresholding on a single feature. We focus on finding k clusters, and hence, we use trees with k leaves. Each leaf corresponds to a cluster, and the tree is as small as possible. We refer to such a tree as a threshold tree.

There are many benefits of using a small threshold tree to produce a clustering. Any cluster assignment is explained by computing the thresholds along the root-to-leaf path. By restricting to k leaves, we ensure that each such path accesses at most 1 features, independent of the data dimension. In general, a threshold tree provides an initial quantization of the data set, which can be combined with other methods for future learning tasks. While we consider static data sets, new data points can be easily clustered by using the tree, leading to explainable assignments. To analyze clustering quality, we consider the k-means and k-medians objectives [36, 53]. The goal is to efficiently determine a set of k centers that minimize either the squared or the distance, respectively, of the input vectors to their closest center.

Figure 1 provides an example of standard and explainable k-means clustering on the same data set. Figure 1(a) on the left shows an optimal 5-means clustering. Figure 1(b) in the middle shows an explainable, tree-based 5-means clustering, determined by the tree in Figure 1(c) on the right. The tree has five leaf nodes, and vectors are assigned to clusters based on the thresholds. Geometrically, the tree defines a set of axis-aligned cuts that determine the clusters. While the two clusterings are very similar, using the threshold tree leads to easy explanations, whereas using a standard k-means clustering algorithm leads to more complicated clusters. The difference between the two approaches becomes more evident in higher dimensions, because standard algorithms will likely determine clusters based on all of the feature values.

To reap the benefits of explainable clusters, we must ensure that the data partition is a good approximation of the optimal clustering. While many efficient algorithms have been developed for k-means/medians clustering, the resulting clusters are often hard to interpret [6, 28, 43, 51]. For example, Lloyd’s algorithm alternates between determining the best center for the clusters and reassigning points to the closest center [34]. The resulting set of centers depends in a complex way to the other points in the data set. Therefore, the relationship between a point and its nearest center may be the result of an opaque combination of many feature values. This issue persists even after dimension reduction or feature selection, because a non-explainable clustering algorithm is often invoked on the modified data set. As our focus is on pre-modeling explanability, we aim for simple explanations that use the original feature vectors.

Even though Figure 1 depicts a situation in which the optimal clustering is very well approximated by one that is induced by a tree, it is not clear whether this would be possible in general. Our first technical challenge is to understand the price of explainability in the context of clustering: that is, the multiplicative blowup in k-means (or k-medians) cost that is inevitable if we force our final clustering to have a highly constrained, interpretable, form. The second challenge is to actually find such a tree efficiently. This is non-trivial because it requires a careful, rather than random, choice of a subset of features. As we will see, the kind of analysis that is ultimately needed is quite novel even given the vast existing literature on clustering.

1.1 Our contributions

We provide several new theoretical results on explainable k-means and k-medians clustering. Our new algorithms and lower bounds are summarized in Table 1.

Basic limitations. A partition into k clusters can be realized by a binary threshold tree with 1 internal splits. This uses at most 1 features, but is it possible to use even fewer, say log k features? In Section 3, we demonstrate a simple data set that requires Ω(k) features to achieve a explainable clustering with bounded approximation ratio compared to the optimal k-means/medians clustering. In particular, the depth of the tree might need to be 1 in the worst case.

One idea for building a tree is to begin with a good k-means (or k-medians) clustering, use it to label all the points, and then apply a supervised decision tree algorithm that attempts to capture this labeling. In Section 3, we show that standard decision tree algorithms, such as ID3, may produce clusterings with arbitrarily high cost. Thus, existing splitting criteria are not suitable for finding a low-cost clustering, and other algorithms are needed.

New algorithms. On the positive side, we provide efficient algorithms to find a small threshold tree that comes with provable guarantees on the cost. We note that using a small number of clusters is preferable for easy interpretations, and therefore k is often relatively small. For the special case of two clusters (k = 2), we show (Theorem 4.1) that a single threshold cut provides a constant-factor approximation to the optimal 2-medians/means clustering, with a closely-matching lower bound (Theorem 4.4), and we provide an efficient algorithm for finding the best cut. For general k, we show how to approximate any clustering by using a threshold tree with k leaves (Algorithm 1). The main idea is to minimize the number of mistakes made at each node in the tree, where a mistake occurs when a threshold separates a point from its original center. Overall, the cost of the explainable clustering will be close to the original cost up to a factor that depends on the tree depth (Theorem 5.1). In the worst-case, we achieve an approximation factor of ) for k-means and O(k) for k-medians compared to the cost of any clustering (e.g., the optimal cost). These results do not depend on the dimension or input size; hence, we get a constant factor approximation when k is constant.

Approximation lower bounds. Since our upper bounds depend on k, it is natural to wonder whether it is possible to achieve a constant-factor approximation, or whether the cost of explainability grows with k. On the negative side, we identify a data set such that any threshold tree with k leaves must incur an Ω(log k)-approximation for both k-medians and k-means (Theorem 5.9). For this data set, our algorithm achieves a nearly matching bound for k-medians.

Table 1: Summary of our upper and lower bounds on approximating the optimal k-medians/means clustering with explainable, tree-based clusters. The values express the factor increase compared to the optimal solution in the worst case.

1.2 Related work

The majority of work on explainable methods considers supervised learning, and in particular, explaining predictions of neural networks and other trained models [5, 20, 22, 29, 32, 35, 40, 42, 46, 47, 48, 52]. In contrast, there is much less work on explainable unsupervised learning. Standard algorithms for k-medians/means use iterative algorithms to produce a good approximate clustering, but this leads to complicated clusters that depend on subtle properties of the data set [1, 6, 28, 43]. Several papers consider the use of decision trees for explainable clustering [13, 21, 23, 24, 33]. However, all prior work on this topic is empirical, without any theoretical analysis of quality compared to the optimal clustering. We also remark that the previous results on tree-based clustering have not considered the k-medians/means objectives for evaluating the quality of the clustering, which is the focus of our work. It is NP-hard to find the optimal k-means clustering [4, 19] or even a very close approximation [8]. In other words, we expect tree-based clustering algorithms to incur an approximation factor bounded away from one compared to the optimal clustering.

One way to cluster based on few features is to use dimensionality reduction. Two main types of dimensionality reduction methods have been investigated for k-medians/means. Work on feature selection shows that it is possible to cluster based on Θ(k) features and obtain a constant factor approximation for k-means/medians [15, 17]. However, after selecting the features, these methods employ existing approximation algorithms to find a good clustering, and hence, the cluster assignments are not explainable. Work on feature extraction shows that it is possible to use the Johnson-Lindenstrauss transform to Θ(log k) dimensions, while preserving the clustering cost [11, 38]. Again, this relies on running a k-means/medians algorithm after projecting to the low dimensional subspace. The resulting clusters are not explainable, and moreover, the features are arbitrary linear combinations of the original features.

Besides explainability, many other clustering variants have received recent attention, such as fair clustering [2, 9, 12, 16, 26, 30, 37, 49], online clustering [14, 18, 25, 31, 41], and the use of same-cluster queries [3, 7, 27, 39]. An interesting avenue for future work would be to further develop tree-based clustering methods by additionally incorporating some of these other constraints or objectives.

2 Preliminaries

Throughout we use bold variables for vectors, and we use non-bold for scalars such as feature values. Given a set of points and an integer k the goal of k-medians and k-means clustering is to partition X into k subsets and minimize the distances of the points to the centers of the clusters. It is known that the optimal centers correspond to means or medians of the clusters, respectively. Denoting the centers as , the aim of k-means is to find a clustering that minimizes the following objective

where ) = arg min. Similarly, the goal of k-medians is to minimize

where ) = arg min. As it will be clear from context whether we are talking about k-medians or k-means, we abuse notation and write cost and c(x) for brevity. We also fix the data set and use opt to denote the optimal k-medians/means clustering, where the optimal centers are the medians or means of the clusters, respectively; hence, cost(opt) refers to the cost of the optimal k-medians/means clustering.

2.1 Clustering using threshold trees

Perhaps the simplest way to define two clusters is to use a threshold cut, which partitions the data based on a threshold for a single feature. More formally, the two clusters can be written as = ( ), which is defined using a coordinate i and a threshold in the following way. For each input point , we place ] in the first cluster , and otherwise . A threshold cut can be used to explain 2-means or 2-medians clustering because a single feature and threshold determines the division of the data set into exactly two clusters.

For k > 2 clusters, we consider iteratively using threshold cuts as the basis for the cluster explanations. More precisely, we construct a binary threshold tree. This tree is an unsupervised variant of a decision tree. Each internal node contains a single feature and threshold, which iteratively partitions the data, leading to clusters determined by the vectors that reach the leaves. We focus on trees with exactly k leaves, one for each cluster {1, 2, . . . , k}, which also limits the depth and total number of features to at most

When clustering using such a tree, it is easy to understand why x was assigned to its cluster: we may simply inspect the threshold conditions on the root-to-leaf path for x. This also ensures the number of conditions for the cluster assignment is rather small, which is crucial for interpretability. These tree-based explanations are especially useful in high-dimensional space, when the number of clusters is much smaller than the input dimension (). More formally, a threshold tree T with k leaves induces a k-clustering of the data. Denoting these clusters as , we define the k-medians/means cost of the tree as

Our goal is to understand when it is possible to efficiently produce a tree T such that cost(T) is not too large compared to the optimal k-medians/means cost. Specifically, we say that an algorithm is an a-approximation, if the cost is at most a times the optimal cost, i.e., if the algorithm returns threshold tree T then we have cost(denotes the optimal k-medians/means clustering.

(a) Optimal threshold tree for the data set in

(b) The ID3 split results in a 3-means/medians clustering with arbitrarily worse cost than the optimal because it places the top two points in separate clusters. Our algorithm (Section 5) instead starts with the optimal first split.

Figure 2: Motivating examples showing that (a) threshold trees may need depth 1 to determine k clusters, and (b) standard decision tree algorithms such as ID3 or CART perform very badly on some data sets.

3 Motivating Examples

We start with a simple but important bound showing that trees with depth less than k (or fewer than 1 features) can be arbitrarily worse than the optimal clustering. Consider the data set consisting of the 1 standard basis vectors along with the all zeros vector. As this data set has k points, the optimal k-median/means cost is zero, putting each point in its own cluster. Unfortunately, it is easy to see that for this data, depth 1 is necessary for clustering with a threshold tree. Figure 2(a) depicts an optimal tree for this data set. Shorter trees do not work because projecting onto any 2 coordinates does not separate the data, as at least two points will have all zeros in these coordinates. Therefore, any tree with depth at most 2 will put two points in the same cluster, leading to non-zero cost, whereas the optimal cost is zero. In other words, for this data set, caterpillar trees such as Figure 2(a) are necessary and sufficient for an optimal clustering. This example also shows that Θ(k) features are tight for feature selection [17] and provides a separation with feature extraction methods that use a linear map to only a logarithmic number of dimensions [11, 38].

Standard top-down decision trees do not work. A natural approach to building a threshold tree is to (1) find a good k-medians or k-means clustering using a standard algorithm, then (2) use it to label all the points, and finally (3) apply a supervised decision tree learning procedure, such as ID3 [44, 45] to find a threshold tree that agrees with these cluster labels as much as possible. ID3, like other common decision tree algorithms, operates in a greedy manner, where at each step it finds the best split in terms of entropy or information gain. We will show that this is not a suitable strategy for clustering and that the resulting tree can have cost that is arbitrarily bad. In what follows, denote by cost() the cost of the decision tree with leaves returned by ID3 algorithm.

Figure 2(b) depicts a data set partitioned into three clusters . We define two centers = (0) and = (2, 0) and for each , we define as 500 i.i.d. points ) for some small 0. Then, (2, v)} where . With high probability, we have that the optimal 3-means clustering is (), i.e. gets label such that . The ID3 algorithm minimizes the entropy at each step. In the first iteration, it splits between the two large clusters. As a result () will also be separated from one another. Since outputs a tree with exactly three leaves, one of the leaves must contain a point from together with points from either means that cost() = Ω(. Note that cost(()) does not depend on v, and hence, it is substantially smaller than cost(). Unlike ID3, the optimal threshold tree first separates and in the second split it separates . Putting the outliers in a separate cluster is necessary for an optimal clustering. It is easy to extend this example to more clusters or to when ID3 uses more leaves.

4 Two Clusters Using a Single Threshold Cut

In this section, we consider the case of k = 2 clusters, and we study how well a single threshold cut can approximate the optimal partition into two clusters.

4.1 Algorithm for k = 2

We present an algorithm to efficiently minimize the cost using a single threshold cut. We begin by considering a single feature i and determining the value of the best threshold for this feature. Then, we minimize over all features ] to output the best threshold cut. We focus on the 2-means algorithm; the 2-medians case is similar.

For feature i, we first sort the input points according to this feature, i.e., assume that the vectors are indexed as Notice that when restricting to this feature, there are only 1 possible partitions of the data set into two non-empty clusters. In particular, we can calculate the cost of all threshold cuts for the ith feature by scanning the values in this feature from smallest to largest. Then, we compute for each position

where we denote the optimal centers for these clusters as ) = and ) = because these are the means of the first points, respectively. Because there are O(nd) possible thresholds, and naively computing the cost of each requires time O(nd), this would lead to a running time of ). We can improve the time to + nd log n) by using dynamic programming. Pseudo-code for the algorithm and description of the dynamic programming are in Appendix D.

4.2 Theoretical guarantees for k = 2

We prove that there always exists a threshold cut with low cost. Since our algorithm from the previous section finds the best cut, it achieves the guarantees of this theorem.

Theorem 4.1. For any data set , there is a threshold cut such that the 2-medians cost satisfies

and there is a threshold cut such that the 2-means cost satisfies

where opt is the optimal 2-medians or means clustering.

The key idea of the analysis is to bound the cost of the threshold clustering in terms of the number of points on which it disagrees with an optimal clustering. Intuitively, if any threshold cut must lead to a fairly different clustering, then the cost of the optimal 2-medians/means clustering must also be large.

We note that it is possible to prove a slightly weaker bound by using the midpoint (for each feature) between the centers. When there are t changes, using the midpoint shows that cost(opt) is at least t times half of the distance between the two centers. In other words, this argument only captures half of the cost. Using Halls theorem, we show that each change corresponds to a pair in the matching, and each such pair contributes to cost(opt) the distance between the centers (not half as before). This improves the bound by a factor of two. The proof for 2-means is in Appendix C.

Notation. We denote the optimal clusters as with optimal centers . Notice that we can assume for each coordinate i because negating the ith coordinate for all points in the dataset does not change the 2-medians/means cost. Assume that a single threshold partitions X into such that

We refer to these t points as changes and assume that t is the minimum possible over all threshold cuts. If is an optimal 2-medians clustering, then we prove that the cost of is at most twice theoptimal 2-medians cost. Similarly, if is an optimal 2-means clustering, then we prove that the cost of is at most four times the optimal 2-means cost. We simply need that the threshold cut minimizes the number of changes t compared to the optimal clusters.

We begin with a structural claim regarding the best threshold cut. This will allow us to obtain a tighter bound on the optimal 2-medians/means cost, compared to the general k > 2 case, in terms of the necessary number of changes. We utilize Hall’s theorem on perfect matchings.

Proposition 4.2 (Hall’s Theorem). Let (P, Q) be a bipartite graph. If all subsets have at least neighbors in Q, then there is a matching of size |P|.

be the optimal clustering of , and assume that any threshold cut requires t changes. For each , there are t disjoint pairs of vectors (and

be the centers for the optimal clusters . Focus on index ], and assume without loss of generality that . The t pairs correspond to a matching the following bipartite graph (P, Q). Let and define as the t points in with largest value in their ith coordinate. Connect by an edge if only if By construction, a matching with t edges implies our claim. By Hall’s theorem, we just need to prove that has at least neighbors. Index by ascending value of ith coordinate, Now, notice that vertices inP have nested neighborhoods: for all , the neighborhood of is a subset of the neighborhood of . It suffices to prove that has at least j neighbors, because this implies that any subset has at least neighbors, guaranteeing a matching of size |P| = t. Indeed, if then we know that for some , implying that has at least neighbors.

Assume for contradiction that has at most 1 neighbors. We argue that the threshold cut has fewer than t changes, which contradicts the fact that all threshold cuts must make at least t changes. By our assumption, there are at most 1 points that are smaller than and belong to the second cluster. By the definition of P, there are exactly points with a larger ith coordinate than in the first cluster. Therefore, the threshold cut makes at most (changes, a contradiction.

4.3 Upper Bound Proof for 2-medians

Suppose are optimal 2-medians centers for clusters , and that the threshold cut changes, which is the minimum possible. A simple argument allows us to upper bound the cost of the threshold cut as cost(opt) plus an error term that depends on the number of changes. More formally, Lemma 5.5 (see Section 5.2.5) in the special case of k = 2 implies that

In other words, the core of the argument is to prove that

Applying Lemma 4.3 for each coordinate ] guarantees t pairs of vectors (following properties. Each corresponds to the ith coordinate of some point in corresponds to the ith coordinate of some point in . Furthermore, for each coordinate, the t pairs correspond to 2t distinct points in X. Finally, we can assume without loss of generality that , which implies that

4.4 Lower bounds for k = 2

We next show that optimal clustering is not, in general, realizable with a single threshold cut, except in a small number of dimensions (e.g., d = 1). Our lower bounds on the approximation ratio increase with the dimension, approaching two for 2-medians or three for 2-means.

The two lower bounds are based on a data set consisting of 2d points, split into two optimal clusters each with d points. The first cluster contains the d vectors of the form , where is the ith coordinate vector and 1 is the all-ones vector. The second cluster contains their negations, . Due to the zero-valued coordinate in each vector, any threshold cut must separate at least one vector from its optimal center. In the case of 2-medians, each incorrect cluster assignment incurs a cost of 2d. The optimal cost is roughly 2d, while the threshold cost is roughly 4d (correct assignments contribute the error), leading to an approximation ratio of nearly two. A similar result holds for 2-means. The proof of these two lower bounds is in Appendix B.

Theorem 4.4. For any integer 1, define data set as above. Any threshold cut must have 2-medians cost

and 2-means cost

where opt is the optimal 2-medians or means clustering.

5 Threshold trees with k > 2 leaves

We provide an efficient algorithm to produce a threshold tree with k leaves that constitutes an approximate k-medians or k-means clustering of a data set X. Our algorithm, Iterative Mistake Minimization (IMM), starts with a reference set of cluster centers, for instance from a polynomial-time constant-factor approximation algorithm for k-medians or k-means [1], or from a domain-specific clustering heuristic.

We then begin the process of finding an explainable approximation to this reference clustering, in the form of a threshold tree with k leaves, whose internal splits are based on single features. The way we do this is almost identical for k-medians and k-means, and the analysis is also nearly the same. Our algorithm is deterministic and its run time is only O(kdn log n), after finding the initial centers.

As discussed in Section 3, existing decision tree algorithms use greedy criteria that are not suitable for our tree-building process. However, we show that an alternative greedy criterion—minimizing the number of mistakes at each split (the number of points separated from their corresponding cluster center)—leads to a favorable approximation ratio to the optimal k-medians or k-means cost.

5.1 Our algorithm

We first discuss the running time, and we analyze the approximation guarantees of IMM in Section 5.2. Time analysis of tree building. We sketch how to execute the algorithm in time O(kdn log n) for an n-point data set. At each step of the top-down procedure, we find a coordinate and threshold pair that minimizes the mistakes at this node (line 10 in build tree procedure). We use dynamic programming to avoid recomputing the cost from scratch for each potential threshold. For each coordinate ], we sort the data and centers. Then, we iterate over possible thresholds. We claim that we can process each node in time O(dn log n) because each point will affect the number of mistakes at most twice. Indeed, when the threshold moves, either a data point or a center moves to the other side of the threshold. Since we know the number of mistakes from the previous threshold, we count the new mistakes efficiently as follows. If a single point switches sides, then the number of mistakes changes by at most one. If a center switches sides, which happens at most once, then we update the mistakes for this center. Overall, each point affects the mistakes at most twice (once when changing sides, and once when its center switches sides). Thus, the running time for each internal node is O(dn log n). As the tree has 1 internal nodes, the total time is O(kdn log n).

Figure 3: Figure 3(a) presents the optimal 5-means clustering. Figures 3(b)–3(e) depict the four splits of the IMM algorithm. The first split separates between cluster 1 and the rest, with a single mistake (marked as a red cross). Next, the IMM separates cluster 3 with 2 additional mistakes. The third split separates cluster 2, and this time the minimal number of mistakes is 12 for this split. Eventually, clusters 0 and 4 are separated without any mistakes.

5.2 Approximation guarantee for the IMM algorithm

Our main theoretical contribution is the following result.

Theorem 5.1. Suppose that IMM takes centers and returns a tree T of depth H. Then,

In particular, IMM achieves worst case approximation factors of by using any O(1) approximation algorithm (compared to the optimal k-medians/means) to generate the initial centers.

We state the theorem in terms of the depth of the tree to highlight that the approximation guarantee may depend on the structure of the input data. If the optimal clusters can be easily identified by a small number of salient features, then the tree may have depth O(log k). We later provide a lower bound showing that an Ω(log k) approximation factor is necessary for k-medians and k-means (Theorem 5.9). For this data set, our algorithm produces a threshold tree with depth O(log k), and therefore, the analysis is tight for k-medians. We leave it as an intriguing open question whether the bound can be improved for k-means.

5.2.1 Proof Overview for Theorem 5.1

The proof proceeds in three main steps. First, we rewrite the cost of IMM in terms of the minimum number of mistakes made between the output clustering and the clustering based on the given centers. Second, we provide a lemma that relates the cost of any clustering to the number of mistakes required by a threshold clustering. Finally, we put these two together to show that the output cost is at most an O(H) factor larger than the k-medians cost and at most an O(Hk) factor larger than the k-means cost, respectively, where H is the depth of the IMM tree, and the cost is relative to cost(

The approximation bound rests upon a characterization of the excess clustering cost induced by the tree. For any internal node u of the final tree T, let cell(denote the region of the input space that ends up in that node, and let B(u) be the bounding box of the centers that lie in this node, or more precisely, B(u) = : 1 ). We will be interested in the diameter of this bounding box, measured either by or squared norm, and denoted by diam)) and diam)), respectively.

Upper bounding the cost of the tree. The first technical claim (Lemma 5.5) will show that if IMM takes centers and returns a tree T that incurs mistakes at node

Briefly, any point x that ends up in a different leaf from its correct center incurs some extra cost. To bound this, consider the internal node u at which x is separated from also contains the center

that ultimately ends up in the same leaf as x. For k-medians, the excess cost for x can then be bounded by )). The argument for k-means is similar. These )) terms can in turn be bounded in terms of the cost of the reference clustering.

Lower bounding the reference cost. We next need to relate the cost of the centers to the number of mistakes and the diameter of the cells in the tree. Lemma 5.6 will show that if IMM makes mistakes at node

The proof for this is significantly more complicated than the upper bound mentioned above. Moreover, it contains the main new techniques in our analysis of tree-based clusterings.

The core challenge is that we aim to lower bound the cost of the given centers using only information about the number of mistakes at each internal node. Moreover, the IMM algorithm only minimizes the number of mistakes, and not the cost of each mistake. Therefore, we must show that if every axis-aligned cut in B(u) separates at least from their centers, then there must be a considerable distance between the points in cell(u) and their centers.

To prove this, we analyze the structure of points in each cell. Specifically, we consider the single-coordinate projection of points in the box B(u), and we order the centers in B(u) from smallest to largest for the analysis. If there are centers in node u, we consider the partition of 1) disjoint segments, splitting at the centers and at the midpoints between consecutive centers. Since is the minimum number of mistakes, we must in particular have at least mistakes from the threshold cut at each midpoint. We argue that each of these segments is covered at least times by a certain set of intervals. Specifically, we consider the intervals between mistake points and their true centers, and we say that an interval covers a segment if the segment is contained in the interval. This allows us to capture the cost of mistakes at different distance scales. For example, if a point is very far from its true center, then it covers many disjoint segments, and we show that it also implies a large contribution to the cost. Claim 5.8 in Section 5.2.5 provides our main covering result, and we use this to argue that the cost of the given centers can be lower bounded in terms of the distance between consecutive centers in B(u). For k-medians, we can directly derive a lower bound on the cost in terms of the -means, however, we employ Cauchy-Schwarz, which incurs an extra factor of k in the bound with diam)). Overall, we sum these bounds over the height H of the tree, leading to the claimed upper bounds in the above lemma.

5.2.2 Preliminaries and Notation for Theorem 5.1

Let be the reference centers, and let T be the resulting IMM tree. Each internal node u corresponds to a value and a coordinate ]. The tree partitions X into k clusters based on the points that reach the k leaves in T, where we index the clusters so that leaf j contains the centers and , where is the mean of for k-means and the median of for k-medians. This provides a bijection between old and new centers (and clusters). Recall that the map associates each point to its nearest center (i.e., c(x) corresponds to the cluster assignment given by the centers

For a node denote the surviving data set vectors at node based on the thresholds from the root to u. We also define ] be the set of surviving centers at node u from the set where these centers satisfy the thresholds from the root to u. Define and to be the maximal (smallest and largest) coordinate-wise values of the centers in , that is, for

In other words, using the previous notation and recalling that ), we have that

Recall that for node denotes the number of mistakes incurred during the threshold cut defined by u, where a point x is a mistake at node u if x reaches u, it was not a mistake before, and exactly one of the following two events occurs:

Let be a partition of the input data set into two parts, where x is in if it reaches the same leaf node in T as its center c(x), and otherwise, x is in . In other words, contains all points that are a mistake at any node u in T, and the rest of the points are in . We note that the notion of “mistakes” used here is different than the definition of “changes” used for the analysis of 2-means/medians, even though we reuse some of the same notation.

We need a standard consequence of the CauchySchwarz inequality to analyze the k-means cost.

it holds that

Proof. Denote by a the vector () and by b the vector (By the CauchySchwarz

inequality

We also need two facts, which state the optimal center for a cluster corresponds to mean or median of the points in the cluster, respectively. The proofs of these facts can be found in standard texts [50].

Fact 5.3. For any set , the optimal center under the cost is the mean

Fact 5.4. For any set , the optimal center cost is the coordinate-wise median, defined for

5.2.3 The Two Main Lemmas and the Proof of Theorem 5.1

To prove the theorem, we state two lemmas that aid in analyzing the cost of the given clustering versus the IMM clustering. The theorem will follow from these lemmas, and we will prove the lemmas in the proceeding subsections. We start with the lemma relating the number of mistakes at each node u and the distance between to the cost incurred by the given centers.

Lemma 5.5. If IMM takes centers and returns a tree T of depth H that incurs mistakes at node

We next bound the cost of the given centers in the terms of the number of mistakes in the tree. The key idea is that if there must be many mistakes at each node, then the cost of the given centers must actually be fairly large.

Lemma 5.6. If IMM takes centers and returns a tree T of depth H that incurs mistakes at node

Combining these two lemmas immediately implies Theorem 5.1.

Proof of Theorem 5.1. For k-medians, Lemmas 5.5 and 5.6 together imply that

For k-means, we have that

5.2.4 Proof of Lemma 5.5

We begin with the k-medians proof (the k-means proof will be similar). Notice that the cost can only increase when measuring the distance to the (suboptimal) center instead of the (optimal) center for cluster and hence,