A central direction in machine learning is understanding the reasoning behind decisions made by learned models [32, 40, 42]. Prior work on AI explainability focuses on the interpretation of a black-box model, known as post-modeling explainability [10, 47]. While methods such as LIME [46] or Shapley explanations [35] have made progress in this direction, they do not provide direct insight into the underlying data set, and the explanations depend heavily on the given model. This has raised concerns about the applicability of current solutions, leading researchers to consider more principled approaches to interpretable methods [48].
We address the challenge of developing machine learning systems that are explainable by design, starting from an unlabeled data set. Specifically, we consider pre-modeling explainability in the context of clustering. A common use of clustering is to identify patterns or discover structural properties in a data set by quantizing the unlabeled points. For instance, k-means clustering may be used to discover coherent groups among a supermarket’s customers. While there are many good clustering algorithms, the resulting cluster assignments can be hard to understand because the clusters may be determined using all the features of the data, and
(a) Optimal 5-means clusters
Figure 1: The optimal 5-means clustering (left) determines uses combinations of both features. The explainable clustering (middle) uses axis-aligned rectangles summarized by the threshold tree (right). Because the clusters contain nearby points, a small threshold tree makes very few mistakes and leads to a good approximation. The benefit of explainability would be more apparent in higher dimensions.
there may be no concise way to explain the inclusion of a particular point in a cluster. This limits the ability of users to discern the commonalities between points within a cluster or understand why points ended up in different clusters.
Our goal is to develop accurate, efficient clustering algorithms with concise explanations of the cluster assignments. There should be a simple procedure using a few features to explain why any point belongs to its cluster. Small decision trees have been identified as a canonical example of an easily explainable model [40, 42], and previous work on explainable clustering uses an unsupervised decision tree [13, 21, 23, 24, 33]. Each node of the binary tree iteratively partitions the data by thresholding on a single feature. We focus on finding k clusters, and hence, we use trees with k leaves. Each leaf corresponds to a cluster, and the tree is as small as possible. We refer to such a tree as a threshold tree.
There are many benefits of using a small threshold tree to produce a clustering. Any cluster assignment is explained by computing the thresholds along the root-to-leaf path. By restricting to k leaves, we ensure that each such path accesses at most 1 features, independent of the data dimension. In general, a threshold tree provides an initial quantization of the data set, which can be combined with other methods for future learning tasks. While we consider static data sets, new data points can be easily clustered by using the tree, leading to explainable assignments. To analyze clustering quality, we consider the k-means and k-medians objectives [36, 53]. The goal is to efficiently determine a set of k centers that minimize either the squared
or the
distance, respectively, of the input vectors to their closest center.
Figure 1 provides an example of standard and explainable k-means clustering on the same data set. Figure 1(a) on the left shows an optimal 5-means clustering. Figure 1(b) in the middle shows an explainable, tree-based 5-means clustering, determined by the tree in Figure 1(c) on the right. The tree has five leaf nodes, and vectors are assigned to clusters based on the thresholds. Geometrically, the tree defines a set of axis-aligned cuts that determine the clusters. While the two clusterings are very similar, using the threshold tree leads to easy explanations, whereas using a standard k-means clustering algorithm leads to more complicated clusters. The difference between the two approaches becomes more evident in higher dimensions, because standard algorithms will likely determine clusters based on all of the feature values.
To reap the benefits of explainable clusters, we must ensure that the data partition is a good approximation of the optimal clustering. While many efficient algorithms have been developed for k-means/medians clustering, the resulting clusters are often hard to interpret [6, 28, 43, 51]. For example, Lloyd’s algorithm alternates between determining the best center for the clusters and reassigning points to the closest center [34]. The resulting set of centers depends in a complex way to the other points in the data set. Therefore, the relationship between a point and its nearest center may be the result of an opaque combination of many feature values. This issue persists even after dimension reduction or feature selection, because a non-explainable clustering algorithm is often invoked on the modified data set. As our focus is on pre-modeling explanability, we aim for simple explanations that use the original feature vectors.
Even though Figure 1 depicts a situation in which the optimal clustering is very well approximated by one that is induced by a tree, it is not clear whether this would be possible in general. Our first technical challenge is to understand the price of explainability in the context of clustering: that is, the multiplicative blowup in k-means (or k-medians) cost that is inevitable if we force our final clustering to have a highly constrained, interpretable, form. The second challenge is to actually find such a tree efficiently. This is non-trivial because it requires a careful, rather than random, choice of a subset of features. As we will see, the kind of analysis that is ultimately needed is quite novel even given the vast existing literature on clustering.
1.1 Our contributions
We provide several new theoretical results on explainable k-means and k-medians clustering. Our new algorithms and lower bounds are summarized in Table 1.
Basic limitations. A partition into k clusters can be realized by a binary threshold tree with 1 internal splits. This uses at most
1 features, but is it possible to use even fewer, say log k features? In Section 3, we demonstrate a simple data set that requires Ω(k) features to achieve a explainable clustering with bounded approximation ratio compared to the optimal k-means/medians clustering. In particular, the depth of the tree might need to be
1 in the worst case.
One idea for building a tree is to begin with a good k-means (or k-medians) clustering, use it to label all the points, and then apply a supervised decision tree algorithm that attempts to capture this labeling. In Section 3, we show that standard decision tree algorithms, such as ID3, may produce clusterings with arbitrarily high cost. Thus, existing splitting criteria are not suitable for finding a low-cost clustering, and other algorithms are needed.
New algorithms. On the positive side, we provide efficient algorithms to find a small threshold tree that comes with provable guarantees on the cost. We note that using a small number of clusters is preferable for easy interpretations, and therefore k is often relatively small. For the special case of two clusters (k = 2), we show (Theorem 4.1) that a single threshold cut provides a constant-factor approximation to the optimal 2-medians/means clustering, with a closely-matching lower bound (Theorem 4.4), and we provide an efficient algorithm for finding the best cut. For general k, we show how to approximate any clustering by using a threshold tree with k leaves (Algorithm 1). The main idea is to minimize the number of mistakes made at each node in the tree, where a mistake occurs when a threshold separates a point from its original center. Overall, the cost of the explainable clustering will be close to the original cost up to a factor that depends on the tree depth (Theorem 5.1). In the worst-case, we achieve an approximation factor of ) for k-means and O(k) for k-medians compared to the cost of any clustering (e.g., the optimal cost). These results do not depend on the dimension or input size; hence, we get a constant factor approximation when k is constant.
Approximation lower bounds. Since our upper bounds depend on k, it is natural to wonder whether it is possible to achieve a constant-factor approximation, or whether the cost of explainability grows with k. On the negative side, we identify a data set such that any threshold tree with k leaves must incur an Ω(log k)-approximation for both k-medians and k-means (Theorem 5.9). For this data set, our algorithm achieves a nearly matching bound for k-medians.
Table 1: Summary of our upper and lower bounds on approximating the optimal k-medians/means clustering with explainable, tree-based clusters. The values express the factor increase compared to the optimal solution in the worst case.
1.2 Related work
The majority of work on explainable methods considers supervised learning, and in particular, explaining predictions of neural networks and other trained models [5, 20, 22, 29, 32, 35, 40, 42, 46, 47, 48, 52]. In contrast, there is much less work on explainable unsupervised learning. Standard algorithms for k-medians/means use iterative algorithms to produce a good approximate clustering, but this leads to complicated clusters that depend on subtle properties of the data set [1, 6, 28, 43]. Several papers consider the use of decision trees for explainable clustering [13, 21, 23, 24, 33]. However, all prior work on this topic is empirical, without any theoretical analysis of quality compared to the optimal clustering. We also remark that the previous results on tree-based clustering have not considered the k-medians/means objectives for evaluating the quality of the clustering, which is the focus of our work. It is NP-hard to find the optimal k-means clustering [4, 19] or even a very close approximation [8]. In other words, we expect tree-based clustering algorithms to incur an approximation factor bounded away from one compared to the optimal clustering.
One way to cluster based on few features is to use dimensionality reduction. Two main types of dimensionality reduction methods have been investigated for k-medians/means. Work on feature selection shows that it is possible to cluster based on Θ(k) features and obtain a constant factor approximation for k-means/medians [15, 17]. However, after selecting the features, these methods employ existing approximation algorithms to find a good clustering, and hence, the cluster assignments are not explainable. Work on feature extraction shows that it is possible to use the Johnson-Lindenstrauss transform to Θ(log k) dimensions, while preserving the clustering cost [11, 38]. Again, this relies on running a k-means/medians algorithm after projecting to the low dimensional subspace. The resulting clusters are not explainable, and moreover, the features are arbitrary linear combinations of the original features.
Besides explainability, many other clustering variants have received recent attention, such as fair clustering [2, 9, 12, 16, 26, 30, 37, 49], online clustering [14, 18, 25, 31, 41], and the use of same-cluster queries [3, 7, 27, 39]. An interesting avenue for future work would be to further develop tree-based clustering methods by additionally incorporating some of these other constraints or objectives.
Throughout we use bold variables for vectors, and we use non-bold for scalars such as feature values. Given a set of points and an integer k the goal of k-medians and k-means clustering is to partition X into k subsets and minimize the distances of the points to the centers of the clusters. It is known that the optimal centers correspond to means or medians of the clusters, respectively. Denoting the centers as
, the aim of k-means is to find a clustering that minimizes the following objective
where ) = arg min
. Similarly, the goal of k-medians is to minimize
where ) = arg min
. As it will be clear from context whether we are talking about k-medians or k-means, we abuse notation and write cost and c(x) for brevity. We also fix the data set and use opt to denote the optimal k-medians/means clustering, where the optimal centers are the medians or means of the clusters, respectively; hence, cost(opt) refers to the cost of the optimal k-medians/means clustering.
2.1 Clustering using threshold trees
Perhaps the simplest way to define two clusters is to use a threshold cut, which partitions the data based on a threshold for a single feature. More formally, the two clusters can be written as = (
), which is defined using a coordinate i and a threshold
in the following way. For each input point
, we place
] in the first cluster
, and otherwise
. A threshold cut can be used to explain 2-means or 2-medians clustering because a single feature and threshold determines the division of the data set into exactly two clusters.
For k > 2 clusters, we consider iteratively using threshold cuts as the basis for the cluster explanations. More precisely, we construct a binary threshold tree. This tree is an unsupervised variant of a decision tree. Each internal node contains a single feature and threshold, which iteratively partitions the data, leading to clusters determined by the vectors that reach the leaves. We focus on trees with exactly k leaves, one for each cluster {1, 2, . . . , k}, which also limits the depth and total number of features to at most
When clustering using such a tree, it is easy to understand why x was assigned to its cluster: we may simply inspect the threshold conditions on the root-to-leaf path for x. This also ensures the number of conditions for the cluster assignment is rather small, which is crucial for interpretability. These tree-based explanations are especially useful in high-dimensional space, when the number of clusters is much smaller than the input dimension (). More formally, a threshold tree T with k leaves induces a k-clustering of the data. Denoting these clusters as
, we define the k-medians/means cost of the tree as
Our goal is to understand when it is possible to efficiently produce a tree T such that cost(T) is not too large compared to the optimal k-medians/means cost. Specifically, we say that an algorithm is an a-approximation, if the cost is at most a times the optimal cost, i.e., if the algorithm returns threshold tree T then we have cost(denotes the optimal k-medians/means clustering.
(a) Optimal threshold tree for the data set in
(b) The ID3 split results in a 3-means/medians clustering with arbitrarily worse cost than the optimal because it places the top two points in separate clusters. Our algorithm (Section 5) instead starts with the optimal first split.
Figure 2: Motivating examples showing that (a) threshold trees may need depth 1 to determine k clusters, and (b) standard decision tree algorithms such as ID3 or CART perform very badly on some data sets.
We start with a simple but important bound showing that trees with depth less than k (or fewer than
1 features) can be arbitrarily worse than the optimal clustering. Consider the data set consisting of the
1 standard basis vectors
along with the all zeros vector. As this data set has k points, the optimal k-median/means cost is zero, putting each point in its own cluster. Unfortunately, it is easy to see that for this data, depth
1 is necessary for clustering with a threshold tree. Figure 2(a) depicts an optimal tree for this data set. Shorter trees do not work because projecting onto any
2 coordinates does not separate the data, as at least two points will have all zeros in these coordinates. Therefore, any tree with depth at most
2 will put two points in the same cluster, leading to non-zero cost, whereas the optimal cost is zero. In other words, for this data set, caterpillar trees such as Figure 2(a) are necessary and sufficient for an optimal clustering. This example also shows that Θ(k) features are tight for feature selection [17] and provides a separation with feature extraction methods that use a linear map to only a logarithmic number of dimensions [11, 38].
Standard top-down decision trees do not work. A natural approach to building a threshold tree is to (1) find a good k-medians or k-means clustering using a standard algorithm, then (2) use it to label all the points, and finally (3) apply a supervised decision tree learning procedure, such as ID3 [44, 45] to find a threshold tree that agrees with these cluster labels as much as possible. ID3, like other common decision tree algorithms, operates in a greedy manner, where at each step it finds the best split in terms of entropy or information gain. We will show that this is not a suitable strategy for clustering and that the resulting tree can have cost that is arbitrarily bad. In what follows, denote by cost() the cost of the decision tree with
leaves returned by ID3 algorithm.
Figure 2(b) depicts a data set partitioned into three clusters
. We define two centers
= (
0) and
= (2, 0) and for each
, we define
as 500 i.i.d. points
) for some small
0. Then,
(2, v)} where
. With high probability, we have that the optimal 3-means clustering is (
), i.e.
gets label
such that
. The ID3 algorithm minimizes the entropy at each step. In the first iteration, it splits between the two large clusters. As a result (
) will also be separated from one another. Since
outputs a tree with exactly three leaves, one of the leaves must contain a point from
together with points from either
means that cost(
) = Ω(
. Note that cost((
)) does not depend on v, and hence, it is substantially smaller than cost(
). Unlike ID3, the optimal threshold tree first separates
and in the second split it separates
. Putting the outliers in a separate cluster is necessary for an optimal clustering. It is easy to extend this example to more clusters or to when ID3 uses more leaves.
In this section, we consider the case of k = 2 clusters, and we study how well a single threshold cut can approximate the optimal partition into two clusters.
4.1 Algorithm for k = 2
We present an algorithm to efficiently minimize the cost using a single threshold cut. We begin by considering a single feature i and determining the value of the best threshold for this feature. Then, we minimize over all features
] to output the best threshold cut. We focus on the 2-means algorithm; the 2-medians case is similar.
For feature i, we first sort the input points according to this feature, i.e., assume that the vectors are indexed as Notice that when restricting to this feature, there are only
1 possible partitions of the data set into two non-empty clusters. In particular, we can calculate the cost of all threshold cuts for the ith feature by scanning the values in this feature from smallest to largest. Then, we compute for each position
where we denote the optimal centers for these clusters as ) =
and
) =
because these are the means of the first
points, respectively. Because there are O(nd) possible thresholds, and naively computing the cost of each requires time O(nd), this would lead to a running time of
). We can improve the time to
+ nd log n) by using dynamic programming. Pseudo-code for the algorithm and description of the dynamic programming are in Appendix D.
4.2 Theoretical guarantees for k = 2
We prove that there always exists a threshold cut with low cost. Since our algorithm from the previous section finds the best cut, it achieves the guarantees of this theorem.
Theorem 4.1. For any data set , there is a threshold cut
such that the 2-medians cost satisfies
and there is a threshold cut such that the 2-means cost satisfies
where opt is the optimal 2-medians or means clustering.
The key idea of the analysis is to bound the cost of the threshold clustering in terms of the number of points on which it disagrees with an optimal clustering. Intuitively, if any threshold cut must lead to a fairly different clustering, then the cost of the optimal 2-medians/means clustering must also be large.
We note that it is possible to prove a slightly weaker bound by using the midpoint (for each feature) between the centers. When there are t changes, using the midpoint shows that cost(opt) is at least t times half of the distance between the two centers. In other words, this argument only captures half of the cost. Using Halls theorem, we show that each change corresponds to a pair in the matching, and each such pair contributes to cost(opt) the distance between the centers (not half as before). This improves the bound by a factor of two. The proof for 2-means is in Appendix C.
Notation. We denote the optimal clusters as with optimal centers
. Notice that we can assume
for each coordinate i because negating the ith coordinate for all points in the dataset does not change the 2-medians/means cost. Assume that a single threshold partitions X into
such that
We refer to these t points as changes and assume that t is the minimum possible over all threshold cuts. If is an optimal 2-medians clustering, then we prove that the cost of
is at most twice theoptimal 2-medians cost. Similarly, if
is an optimal 2-means clustering, then we prove that the cost of
is at most four times the optimal 2-means cost. We simply need that the threshold cut
minimizes the number of changes t compared to the optimal clusters.
We begin with a structural claim regarding the best threshold cut. This will allow us to obtain a tighter bound on the optimal 2-medians/means cost, compared to the general k > 2 case, in terms of the necessary number of changes. We utilize Hall’s theorem on perfect matchings.
Proposition 4.2 (Hall’s Theorem). Let (P, Q) be a bipartite graph. If all subsets have at least
neighbors in Q, then there is a matching of size |P|.
be the optimal clustering of
, and assume that any threshold cut requires t changes. For each
, there are t disjoint pairs of vectors (
and
be the centers for the optimal clusters
. Focus on index
], and assume without loss of generality that
. The t pairs correspond to a matching the following bipartite graph (P, Q). Let
and define
as the t points in
with largest value in their ith coordinate. Connect
by an edge if only if
By construction, a matching with t edges implies our claim. By Hall’s theorem, we just need to prove that
has at least
neighbors. Index
by ascending value of ith coordinate,
Now, notice that vertices inP have nested neighborhoods: for all
, the neighborhood of
is a subset of the neighborhood of
. It suffices to prove that
has at least j neighbors, because this implies that any subset
has at least
neighbors, guaranteeing a matching of size |P| = t. Indeed, if
then we know that
for some
, implying that
has at least
neighbors.
Assume for contradiction that has at most
1 neighbors. We argue that the threshold cut
has fewer than t changes, which contradicts the fact that all threshold cuts must make at least t changes. By our assumption, there are at most
1 points that are smaller than
and belong to the second cluster. By the definition of P, there are exactly
points with a larger ith coordinate than
in the first cluster. Therefore, the threshold cut
makes at most (
changes, a contradiction.
4.3 Upper Bound Proof for 2-medians
Suppose are optimal 2-medians centers for clusters
, and that the threshold cut
changes, which is the minimum possible. A simple argument allows us to upper bound the cost of the threshold cut as cost(opt) plus an error term that depends on the number of changes. More formally, Lemma 5.5 (see Section 5.2.5) in the special case of k = 2 implies that
In other words, the core of the argument is to prove that
Applying Lemma 4.3 for each coordinate ] guarantees t pairs of vectors (
following properties. Each
corresponds to the ith coordinate of some point in
corresponds to the ith coordinate of some point in
. Furthermore, for each coordinate, the t pairs correspond to 2t distinct points in X. Finally, we can assume without loss of generality that
, which implies that
4.4 Lower bounds for k = 2
We next show that optimal clustering is not, in general, realizable with a single threshold cut, except in a small number of dimensions (e.g., d = 1). Our lower bounds on the approximation ratio increase with the dimension, approaching two for 2-medians or three for 2-means.
The two lower bounds are based on a data set consisting of 2d points, split into two optimal clusters each with d points. The first cluster contains the d vectors of the form
, where
is the ith coordinate vector and 1 is the all-ones vector. The second cluster contains their negations,
. Due to the zero-valued coordinate in each vector, any threshold cut must separate at least one vector from its optimal center. In the case of 2-medians, each incorrect cluster assignment incurs a cost of 2d. The optimal cost is roughly 2d, while the threshold cost is roughly 4d (correct assignments contribute
the error), leading to an approximation ratio of nearly two. A similar result holds for 2-means. The proof of these two lower bounds is in Appendix B.
Theorem 4.4. For any integer 1, define data set
as above. Any threshold cut
must have 2-medians cost
and 2-means cost
where opt is the optimal 2-medians or means clustering.
We provide an efficient algorithm to produce a threshold tree with k leaves that constitutes an approximate k-medians or k-means clustering of a data set X. Our algorithm, Iterative Mistake Minimization (IMM), starts with a reference set of cluster centers, for instance from a polynomial-time constant-factor approximation algorithm for k-medians or k-means [1], or from a domain-specific clustering heuristic.
We then begin the process of finding an explainable approximation to this reference clustering, in the form of a threshold tree with k leaves, whose internal splits are based on single features. The way we do this is almost identical for k-medians and k-means, and the analysis is also nearly the same. Our algorithm is deterministic and its run time is only O(kdn log n), after finding the initial centers.
As discussed in Section 3, existing decision tree algorithms use greedy criteria that are not suitable for our tree-building process. However, we show that an alternative greedy criterion—minimizing the number of mistakes at each split (the number of points separated from their corresponding cluster center)—leads to a favorable approximation ratio to the optimal k-medians or k-means cost.
5.1 Our algorithm
We first discuss the running time, and we analyze the approximation guarantees of IMM in Section 5.2. Time analysis of tree building. We sketch how to execute the algorithm in time O(kdn log n) for an n-point data set. At each step of the top-down procedure, we find a coordinate and threshold pair that minimizes the mistakes at this node (line 10 in build tree procedure). We use dynamic programming to avoid recomputing the cost from scratch for each potential threshold. For each coordinate ], we sort the data and centers. Then, we iterate over possible thresholds. We claim that we can process each node in time O(dn log n) because each point will affect the number of mistakes at most twice. Indeed, when the threshold moves, either a data point or a center moves to the other side of the threshold. Since we know the number of mistakes from the previous threshold, we count the new mistakes efficiently as follows. If a single point switches sides, then the number of mistakes changes by at most one. If a center switches sides, which happens at most once, then we update the mistakes for this center. Overall, each point affects the mistakes at most twice (once when changing sides, and once when its center switches sides). Thus, the running time for each internal node is O(dn log n). As the tree has
1 internal nodes, the total time is O(kdn log n).
Figure 3: Figure 3(a) presents the optimal 5-means clustering. Figures 3(b)–3(e) depict the four splits of the IMM algorithm. The first split separates between cluster 1 and the rest, with a single mistake (marked as a red cross). Next, the IMM separates cluster 3 with 2 additional mistakes. The third split separates cluster 2, and this time the minimal number of mistakes is 12 for this split. Eventually, clusters 0 and 4 are separated without any mistakes.
5.2 Approximation guarantee for the IMM algorithm
Our main theoretical contribution is the following result.
Theorem 5.1. Suppose that IMM takes centers and returns a tree T of depth H. Then,
In particular, IMM achieves worst case approximation factors of by using any O(1) approximation algorithm (compared to the optimal k-medians/means) to generate the initial centers.
We state the theorem in terms of the depth of the tree to highlight that the approximation guarantee may depend on the structure of the input data. If the optimal clusters can be easily identified by a small number of salient features, then the tree may have depth O(log k). We later provide a lower bound showing that an Ω(log k) approximation factor is necessary for k-medians and k-means (Theorem 5.9). For this data set, our algorithm produces a threshold tree with depth O(log k), and therefore, the analysis is tight for k-medians. We leave it as an intriguing open question whether the bound can be improved for k-means.
5.2.1 Proof Overview for Theorem 5.1
The proof proceeds in three main steps. First, we rewrite the cost of IMM in terms of the minimum number of mistakes made between the output clustering and the clustering based on the given centers. Second, we provide a lemma that relates the cost of any clustering to the number of mistakes required by a threshold clustering. Finally, we put these two together to show that the output cost is at most an O(H) factor larger than the k-medians cost and at most an O(Hk) factor larger than the k-means cost, respectively, where H is the depth of the IMM tree, and the cost is relative to cost(
The approximation bound rests upon a characterization of the excess clustering cost induced by the tree. For any internal node u of the final tree T, let cell(denote the region of the input space that ends up in that node, and let B(u) be the bounding box of the centers that lie in this node, or more precisely, B(u) =
: 1
). We will be interested in the diameter of this bounding box, measured either by
or squared
norm, and denoted by diam
)) and diam
)), respectively.
Upper bounding the cost of the tree. The first technical claim (Lemma 5.5) will show that if IMM takes centers and returns a tree T that incurs
mistakes at node
Briefly, any point x that ends up in a different leaf from its correct center incurs some extra cost. To bound this, consider the internal node u at which x is separated from
also contains the center
that ultimately ends up in the same leaf as x. For k-medians, the excess cost for x can then be bounded by )). The argument for k-means is similar. These
)) terms can in turn be bounded in terms of the cost of the reference clustering.
Lower bounding the reference cost. We next need to relate the cost of the centers to the number of mistakes and the diameter of the cells in the tree. Lemma 5.6 will show that if IMM makes
mistakes at node
The proof for this is significantly more complicated than the upper bound mentioned above. Moreover, it contains the main new techniques in our analysis of tree-based clusterings.
The core challenge is that we aim to lower bound the cost of the given centers using only information about the number of mistakes at each internal node. Moreover, the IMM algorithm only minimizes the number of mistakes, and not the cost of each mistake. Therefore, we must show that if every axis-aligned cut in B(u) separates at least from their centers, then there must be a considerable distance between the points in cell(u) and their centers.
To prove this, we analyze the structure of points in each cell. Specifically, we consider the single-coordinate projection of points in the box B(u), and we order the centers in B(u) from smallest to largest for the analysis. If there are centers in node u, we consider the partition of
1) disjoint segments, splitting at the centers and at the midpoints between consecutive centers. Since
is the minimum number of mistakes, we must in particular have at least
mistakes from the threshold cut at each midpoint. We argue that each of these segments is covered at least
times by a certain set of intervals. Specifically, we consider the intervals between mistake points and their true centers, and we say that an interval covers a segment if the segment is contained in the interval. This allows us to capture the cost of mistakes at different distance scales. For example, if a point is very far from its true center, then it covers many disjoint segments, and we show that it also implies a large contribution to the cost. Claim 5.8 in Section 5.2.5 provides our main covering result, and we use this to argue that the cost of the given centers can be lower bounded in terms of the distance between consecutive centers in B(u). For k-medians, we can directly derive a lower bound on the cost in terms of the
-means, however, we employ Cauchy-Schwarz, which incurs an extra factor of k in the bound with diam
)). Overall, we sum these bounds over the height H of the tree, leading to the claimed upper bounds in the above lemma.
5.2.2 Preliminaries and Notation for Theorem 5.1
Let be the reference centers, and let T be the resulting IMM tree. Each internal node u corresponds to a value
and a coordinate
]. The tree partitions X into k clusters
based on the points that reach the k leaves in T, where we index the clusters so that leaf j contains the centers
and
, where
is the mean of
for k-means and the median of
for k-medians. This provides a bijection between old and new centers (and clusters). Recall that the map
associates each point to its nearest center (i.e., c(x) corresponds to the cluster assignment given by the centers
For a node denote the surviving data set vectors at node
based on the thresholds from the root to u. We also define
] be the set of surviving centers at node u from the set
where these centers satisfy the thresholds from the root to u. Define
and
to be the maximal (smallest and largest) coordinate-wise values of the centers in
, that is, for
In other words, using the previous notation and recalling that ), we have that
Recall that for node
denotes the number of mistakes incurred during the threshold cut defined by u, where a point x is a mistake at node u if x reaches u, it was not a mistake before, and exactly one of the following two events occurs:
Let be a partition of the input data set into two parts, where x is in
if it reaches the same leaf node in T as its center c(x), and otherwise, x is in
. In other words,
contains all points
that are a mistake at any node u in T, and the rest of the points are in
. We note that the notion of “mistakes” used here is different than the definition of “changes” used for the analysis of 2-means/medians, even though we reuse some of the same notation.
We need a standard consequence of the CauchySchwarz inequality to analyze the k-means cost.
it holds that
Proof. Denote by a the vector () and by b the vector (
By the CauchySchwarz
inequality
We also need two facts, which state the optimal center for a cluster corresponds to mean or median of the points in the cluster, respectively. The proofs of these facts can be found in standard texts [50].
Fact 5.3. For any set , the optimal center under the
cost is the mean
Fact 5.4. For any set , the optimal center
cost is the coordinate-wise median, defined for
5.2.3 The Two Main Lemmas and the Proof of Theorem 5.1
To prove the theorem, we state two lemmas that aid in analyzing the cost of the given clustering versus the IMM clustering. The theorem will follow from these lemmas, and we will prove the lemmas in the proceeding subsections. We start with the lemma relating the number of mistakes at each node u and the distance between
to the cost incurred by the given centers.
Lemma 5.5. If IMM takes centers and returns a tree T of depth H that incurs
mistakes at node
We next bound the cost of the given centers in the terms of the number of mistakes in the tree. The key idea is that if there must be many mistakes at each node, then the cost of the given centers must actually be fairly large.
Lemma 5.6. If IMM takes centers and returns a tree T of depth H that incurs
mistakes at node
Combining these two lemmas immediately implies Theorem 5.1.
Proof of Theorem 5.1. For k-medians, Lemmas 5.5 and 5.6 together imply that
For k-means, we have that
5.2.4 Proof of Lemma 5.5
We begin with the k-medians proof (the k-means proof will be similar). Notice that the cost can only increase when measuring the distance to the (suboptimal) center instead of the (optimal) center
for cluster
and hence,
We can rewrite this sum using the partition , using the fact that whenever
the distance is computed with respect to the true center c(x),
Starting with the above cost bound, and using the triangle inequality, we see
To control the second term in the final line, we must bound the cost of the mistakes. We decompose based on the node u where
is first separated from its true center c(x) due to the threshold at node u. To this end, consider some point
, where its distance is measured to the incorrect center
). Both centers c(x) and
have survived until node u in the threshold tree T, and hence, both vectors are part of the definitions of
. In particular, we can use the upper bound
There are caused by the threshold at node u, and we have that
Therefore, we have, as desired
Analyzing k-means is similar; we incur a factor of two by using Claim 5.2 instead of the triangle inequality:
5.2.5 Proof of Lemma 5.6
To prove this lemma, we bound the cost at each node u of tree in terms of the mistakes made at this node. For this lemma, we define to be the set of points in X that reach node u in T along with their center c(x). We note that
differs from
because a point
may not make it to
if there is a mistake later on (i.e.,
is the union of
only over leaf nodes).
Proof. Fix a coordinate ] and a node
. To simplify notation, we let
denote the sorted values of ith coordinate of the
centers that survive until node u (so that
and
). Observe that for each
, the center c(x) must have survived until node u, and hence,
of the values
We need a definition that allows us to relate the cost in coordinate i to the distances between For consecutive values (j, j + 1), we say that the pair (j, j + 1) is covered by x if either
• The segment [) is contained in the segment [
• The segment [ ) is contained in the segment [
We prove the following claim, which enables us to relate the cost in the ith coordinate to the value by decomposing this value into the distance between consecutive centers.
is covered by at least
Proof. Suppose for contradiction that this does not hold. We argue that we can find a threshold value for coordinate i that makes fewer than mistakes. To see this, assume that (j, j + 1) is covered by fewer than
points
. In particular, setting the threshold to be
separates fewer than
from their centers c(x). This implies that there are fewer than
mistakes at node u, which is a contradiction because the IMM algorithm chooses the coordinate and threshold pair that minimizes the number of mistakes.
Now this claim suffices to prove Lemma 5.7. The only challenge is that we must string together the covering points x to get a bound on
coordinate i to the cost of the given centers. Notice that the values partition the interval between
. Thus, each time x covers a pair (j, j + 1), there must be a contribution of
to the cost
. Because each pair is covered at least
times by Claim 5.8, we conclude that
To relate the bound to and
, we note that the above argument holds for each coordinate
], and we have that
For the k-means proof, we apply the same argument as above, this time using Claim 5.2 to bound the sum of squared values as
and therefore, summing over coordinates
Proof of Lemma 5.6. We start with the k-medians proof. The factor of H arises because the same points can appear in at most
is the depth of the tree. More precisely, using Lemma 5.7
for each node u, we have that
Applying the same steps for the k-means cost, we have that
5.3 Approximation lower bound
To complement our upper bounds, we show that a threshold tree with k leaves cannot, in general, yield better than an Ω(log k) approximation to the optimal k-medians or k-means clustering.
Theorem 5.9. For any 2, there exists a data set with k clusters such that any threshold tree T with k leaves must have k-medians and k-means cost at least
The data set is produced by first picking k random centers from the hypercube , for large enough d, and then using each of these to produce a cluster consisting of the d points that can be obtained by replacing one coordinate of the center by zero. Thus the clusters have size d and radius O(1). To prove the lower bound, we use ideas from the study of pseudo-random binary vectors, showing that projecting the centers to any subset of
coordinates take on all 2
possible values, with each occurring roughly equally often. Then, we show that (i) the threshold tree must be essentially a complete binary tree with depth Ω(log
achieve a clustering with low cost, and (ii) any such tree incurs a cost of Ω(log k) times more than the optimal for this data set (for both k-medians and k-means). The proof of Theorem 5.9 appears in Appendix A.
In this paper we discuss the capabilities and limitations of explainable clusters. For the special case of two clusters (k = 2), we provide nearly matching upper and lower bounds for a single threshold cut. For general k > 2, we present the IMM algorithm that achieves an O(H) approximation for k-medians and an O(Hk) approximation for k-means when the threshold tree has depth H and k leaves. We complement our upper bounds with a lower bound showing that any threshold tree with k leaves must have cost at least Ω(log k) more than the optimal for certain data sets. Our theoretical results provide the first approximation guarantees on the quality of explainable unsupervised learning in the context of clustering. Our work makes progress toward the larger goal of explainable AI methods with precise objectives and provable guarantees.
An immediate open direction is to improve our results for k clusters, either on the upper or lower bound side. One option is to use larger threshold trees with more than k leaves (or allowing more than k clusters). It is also an important goal to identify natural properties of the data that enable explainable, accurate clusters. For example, it would be interesting to improve our upper bounds on explainable clustering for well-separated data. Our lower bound of Ω(log k) utilizes clusters with diameter O(1) and separation Ω(d), where the hardness stems from the randomness of the centers. In this case, the approximation factor Θ(log k) is tight because our upper bound proof actually provides a bound in terms of the tree depth (which is about log k, see Appendix A.5). Therefore, an open question is whether a Θ(log k) approximation is possible for any well-separated clusters (e.g., mixture of Gaussians with separated means and small variance). Beyond k-medians/means, it would be worthwhile to develop other clustering methods using a small number of features (e.g., hierarchical clustering).
Acknowledgements. Sanjoy Dasgupta has been supported by NSF CCF-1813160. Nave Frost has been funded by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (Grant agreement No. 804302). The contribution of Nave Frost is part of a Ph.D. thesis research conducted at Tel Aviv University.
[1] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15–28. Springer, 2009.
[2] Sara Ahmadian, Alessandro Epasto, Marina Knittel, Ravi Kumar, Mohammad Mahdian, Benjamin Moseley, Philip Pham, Sergei Vassilvtiskii, and Yuyan Wang. Fair hierarchical clustering. arXiv preprint arXiv:2006.10221, 2020.
[3] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Latin American Symposium on Theoretical Informatics, pages 14–27. Springer, 2018.
[4] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine learning, 75(2):245–248, 2009.
[5] David Alvarez-Melis, Hal Daum´e III, Jennifer Wortman Vaughan, and Hanna Wallach. Weight of evidence as a basis for human-oriented explanations. arXiv preprint arXiv:1910.13503, 2019.
[6] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
[7] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems, pages 3216–3224, 2016.
[8] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of Euclidean k-means. In 31st International Symposium on Computational Geometry,
SoCG 2015, pages 754–767. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2015.
[9] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In International Conference on Machine Learning, pages 405–413, 2019.
[10] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert M˜Aˇzller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
[11] Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In STOC, 2019.
[12] Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, pages 4955–4966, 2019.
[13] Dimitris Bertsimas, Agni Orfanoudaki, and Holly Wiberg. Interpretable clustering via optimal trees. arXiv preprint arXiv:1812.00539, 2018.
[14] Aditya Bhaskara and Aravinda Kanchana Rwanpathirana. Robust algorithms for online k-means clustering. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 148–173, 2020.
[15] Christos Boutsidis, Petros Drineas, and Michael W Mahoney. Unsupervised feature selection for the k-means clustering problem. In NIPS, pages 153–161, 2009.
[16] Ashish Chiplunkar, Sagar Kale, and Sivaramakrishnan Natarajan Ramamoorthy. How to solve fair k-center in massive data models. arXiv preprint arXiv:2002.07682, 2020.
[17] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In STOC, 2015.
[18] Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, and Guy Rom. Online k-means clustering. arXiv preprint arXiv:1909.06861, 2019.
[19] Sanjoy Dasgupta. The hardness of k-means clustering. University of California, San Diego (Technical Report), 2008.
[20] Daniel Deutch and Nave Frost. Constraints-based explanations of classifications. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 530–541. IEEE, 2019.
[21] Ricardo Fraiman, Badih Ghattas, and Marcela Svarc. Interpretable clustering using unsupervised binary trees. Advances in Data Analysis and Classification, 7(2):125–145, 2013.
[22] Damien Garreau and Ulrike von Luxburg. Explaining the explainer: A first theoretical analysis of lime. arXiv preprint arXiv:2001.03447, 2020.
[23] Pierre Geurts, Nizar Touleimat, Marie Dutreix, and Florence d’Alch´e Buc. Inferring biological networks with output kernel trees. BMC Bioinformatics, 8(2):S4, 2007.
[24] Badih Ghattas, Pierre Michel, and Laurent Boyer. Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods. Pattern Recognition, 67:177–185, 2017.
[25] Tom Hess and Sivan Sabato. Sequential no-substitution k-median-clustering. arXiv preprint arXiv:1905.12925, 2019.
[26] Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, pages 7587–7598, 2019.
[27] Wasim Huleihel, Arya Mazumdar, Muriel M´edard, and Soumyabrata Pal. Same-cluster querying for overlapping clusters. In Advances in Neural Information Processing Systems, pages 10485–10495, 2019.
[28] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pages 10–18, 2002.
[29] Jacob Kauffmann, Malte Esders, Gr´egoire Montavon, Wojciech Samek, and Klaus-Robert M¨uller. From clustering to cluster explanations via neural networks. arXiv preprint arXiv:1906.07633, 2019.
[30] Matth¨aus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457, 2019.
[31] Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. An algorithm for online k-means clustering. In 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX), pages 81–89. SIAM, 2016.
[32] Zachary C Lipton. The mythos of model interpretability. Queue, 16(3):31–57, 2018.
[33] Bing Liu, Yiyuan Xia, and Philip S Yu. Clustering via decision tree construction. In Foundations and Advances in Data Mining, pages 97–124. Springer, 2005.
[34] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129– 137, 1982.
[35] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
[36] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Mathematical Statist. Probability, pages 281–297, 1967.
[37] Sepideh Mahabadi and Ali Vakilian. (individual) fairness for k-clustering. arXiv preprint arXiv:2002.06742, 2020.
[38] Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In STOC, 2019.
[39] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in Neural Information Processing Systems, pages 5788–5799, 2017.
[40] Christoph Molnar. Interpretable Machine Learning. Lulu. com, 2019. https://christophm.github. io/interpretable-ml-book/.
[41] Michal Moshkovitz. Unexpected effects of online k-means clustering. arXiv preprint arXiv:1908.06818, 2019.
[42] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592, 2019.
[43] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM, 59(6):1–22, 2013.
[44] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
[45] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
[46] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
[47] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[48] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
[49] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In International Workshop on Approximation and Online Algorithms, pages 232–251. Springer, 2019.
[50] Hinrich Sch¨utze, Christopher D Manning, and Prabhakar Raghavan. Introduction to information retrieval. In Proceedings of the International Communication of Association for Computing Machinery Conference, volume 4, 2008.
[51] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[52] Kacper Sokol and Peter Flach. Limetree: Interactively customisable explanations based on local surrogate multi-output regression trees. arXiv preprint arXiv:2005.01427, 2020.
[53] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801–804, 1956.
Appendix Organization:
• Section A contains the Ω(log k) lower bound for tree-based clustering using k leaves.
• Section B provides improved lower bounds for 2-medians or 2-means for a tree with two leaves.
• Section C contains the remaining upper bound proof exhibiting a 4-approximation for 2-means.
• Section D contains details about improving the running time of the algorithms.
In this section we show that any threshold tree with k leaves must be an Ω(log k)-approximation, under the k-means and k-medians cost. We will show a data set that will cause many mistakes. This data set consists of k clusters where any two clusters are very far from each other while inside any cluster the points differ by at most two features. Each cluster is created by first taking a codeword and then changing one feature at a time to 0. The consequence of this process is that for every feature there are many points that globally are very different yet locally all equal to 0.
The proof of the lower bound has a few steps:
1. In Section A.1 we show that there is a code such that (i) every two points are far apart, and (ii) when inspecting any O(log k) features, many codewords are consistent with this local view. From thiscode we construct our data set with dk points and cost(opt) = O(dk).
2. In Section A.2 we prove that the clusters induced by the threshold tree T are similar to the original clusters, except for at most k points in each cluster. These points will cause cost(T) to be large.
3. In Section A.3 we uncover a few properties of any threshold tree created by an O(log k)-approximation algorithm: up until level O(log k) the tree has to be complete and no feature is used more than once.
4. In Section A.4 we put together all the claims and show that each level causes Ω(k log k) mistakes, each with a cost of Ω() which proves the lower bound of Ω(log k)-approximation.
Data set construction. We first take that have the properties described in Claim A.2. From each codeword v we create d data points,
, each time by changing exactly one feature to 0. In total we have dk points in the data set,
. The cost of the clustering that cluster together all points that belong to the same vector
), as the cost of each point is Θ(1). Thus, cost(
A.1 The data set
Proposition A.1 (Hoeffdings inequality)be independent random variables, where for each i,
[0, 1]. Define the random variable X =
Then, for any
we have Pr(
2
, there are
that have the following properties for any
2. for every their distance is linear, i.e.,
3. for every indexes in [d], and every assignment to these indexes, the number of points in C that has these assignment is at least
Proof. Take k random points in . We will show that the probability that all properties hold is bigger than 0 and this will prove our claim using the probabilistic method. To prove the second property, we use Hoeffding’s inequality and union bound. We can bound the probability that any two points in C agree by more than 3d/4 coordinates by 2
To prove the third property we again use Hoeffding’s inequality and union bound. This time though we have k random variables, one for each point. There are
coordinates, and there are 2
assignments to these coordinates. For specific
coordinates and an assignment to these coordinates the expected number of points in C that has the specific assignment is
. By Hoeffding’s inequality, the probability that we deviate by
is less than
. The probability that the last property does not hold is bounded by
A.2 The cluster created by a threshold tree
Claim A.3. For any threshold tree T with at most k leaves, and for any codeword v, the leaf containing v also contains at least
Proof. There are at most 3 leaves in T, thus in the root to leaf path of the codeword v there are at most
1 features. Hence, all data points in
that agree on this features must reach the same leaf and be in the same cluster. There are at least
such points in
Claim A.4. If there are points from
points from
, that are in the same cluster in T, then their contribution to cost(T) is at least
The claim holds both under the
cost and the
squared cost.
Proof. Denote the center that contains points from
and
points from
by
Without loss of generality
We can disjointly match
points from the two different clusters (
), which means that their contribution to cost(T) is at least
where the first inequality follows Claim 5.2, the second inequality follows from Claim A.2 and the fact that if two codewords are different by at least d/4 features, then the points differ by at least 2 features, each contributing a cost of 4, the third inequality follows from the fact that
16 for
Similarly for the
A.3 The threshold tree
The next two claims prove that if a feature is used twice or the tree is not complete until level clustering tree T cannot be an O(log k)-approximation because it shows that cost(
Claim A.5. Fix a threshold tree T with 3 leaves. If there is a feature that is used twice on the same root-to-leaf path in T, then
Proof. The proof is composed of two steps, first we show that if a feature is used twice then there is leaf that it is unreachable by a codeword. Then we will show that this implies that two codewords share the same cluster, and thus cost(T) is high.
Assume that the there are two nodes in T, both of them use the same feature i, one with threshold the other with threshold
then there is a leaf that is not reachable. Otherwise, the two thresholds divide the line into three parts, and since the codewords have only two values, there is a leaf unreachable by any codeword. Summing up these two cases, there is a leaf that is not reached by any codeword.
From the pigeonhole principle there are two codewords that share the same cluster which is a contradiction using Claims A.3 and A.4.
Claim A.6. If threshold T contains a leaf at depth less than
The claim is true both under the k-means and the k-medians cost.
Proof. Assume T is not complete until level . So there is a leaf at a level smaller than
. By the construction of the data set, there are at least (
1 codewords that reach this leaf. The claim follows from Claims A.3 and A.4.
A.4 Proof of Theorem 5.9
Assume by contradiction that T is an O(log k)-approximation. From Claim A.3 we deduce that for each codeword points from
will be in the same cluster. From Claim A.4 and the assumption that T is O(log k)-approximation we get that each
such points must be in its own cluster, this cluster will be called the
of generality, that each threshold is either 0Focus on some node in
with feature i and threshold
5, then for all codewords
= 1 the point in
will be separated for its main cluster. From the construction of the data set, there are at least (
such points. Similarly for
we can show that there are at least (
such points. From Claim A.5, we deduce that these mistakes are disjoint.
i.e., points that will not go with their codeword, can be lower bounded by the following using =
and large enough k:
Thus, from Claim A.3 we can lower bound the cost of T:
A.5 IMM Upper Bound for this dataset
We sketch the proof that the IMM algorithm produces a tree of depth O(log k) for the above dataset construction with high probability. In particular, the upper bound from Theorem 5.1 is tight for k-medians up to the leading constant for this dataset.
The analysis will follow the standard bound on the maximum clique size in a random graph. Consider fixing any = 3 log
coordinates to
1. When the set of k centers C is chosen uniformly at random from
and
, we show that with high probability there are at most
centers consistent with these values. When IMM builds the tree, it always chooses a threshold that reduces the number of centers in the children of the current node, and hence, it never splits on the same feature twice. Moreover, it stops the recursion when there is a single center in a leaf. Therefore, after 3 log
thresholds, the remaining depth of the tree is at most 3 log
, and hence, the total depth of the tree is at most 6 log
More formally, let be any sign pattern, and let
be set of centers having pattern
when projected onto coordinates
] with
. Then, using the standard upper bound on the binomial coefficient, we have
Therefore, plugging in 2, we see that this probability is at most (
. Taking a union bound over the 2
possible settings of
shows that the probability that there are
centers consistent with any
tends to zero as k increases.
Without loss of generality we can assume that We use the following dataset for both 2-medians and 2-means. It consists of 2d points, partitioned into two clusters of size d, which are the points with Hamming distance exactly one from the vector with all 1 entries and the vector with all
Let ) be the best threshold cut.
2-medians lower bound. The cost of the cluster with centers (1, . . . , 1) and (1) is 2d, as each point is responsible for a cost of 1. Thus, cost(
There is a coordinate i and a threshold that defines the cut
. For any coordinate i, there are only three possible values:
is either in (
0) or in (0, 1). Without loss of generality, assume that
= 1. Thus, the cut is composed of two clusters: one of size
1 and the other of size d + 1, in the following way:
Using Fact 5.4, an optimal center of the first cluster is all 1, and the optimal center for the second cluster is all 1. The cost of the first cluster is
1, as each point costs 1. The cost of the second cluster is composed of two terms d for all points that include 1 in at least one coordinate and the cost of point (0
1) + 1. So the total cost is 4
2. Thus cost(
2-means lower bound. Focus on the clustering with centers
The cost of each point in the data is composed of (1) one coordinate with value zero, and the cost of this coordinate is (1 coordinates each with cost
Thus, each point has a cost of
Thus, the total cost is
1). This implies that cost(
resulting clusters are as in the case of 2-medians. The optimal centers are (see Fact 5.3):
We want to lower bound cost( We start with the cost of the first cluster, i.e.
. To do so for each point in
, we will evaluate the contribution of each coordinate to the cost (1) the first coordinate adds 0 to the cost (2) the coordinate with value 0, adds
to the cost (3) the rest of the
2 coordinates adds
Thus, each point in
adds to the cost
1 points, its total cost is
Moving on to evaluating the cost of , the cost of the point (0
1) is composed of two terms (1) the first coordinate adds
to the cost (2) each of the other
1 coordinates adds
to the cost. Thus, this point adds
Similarly, the point (0, 1, . . . , 1) adds to the cost
Finally, each of the 1 remaining points in
adds to the cost
Thus, the cost of
Summing up the costs of
We show that there is a threshold cut with 2-means cost satisfying cost(
We could just use the same proof idea as in the 2-medians case that first applies Lemma 5.5 and then uses the matching result, Lemma 4.3. This leads to a 6-approximation, instead of 4. The reason is that we apply twice Claim 5.2, which is not tight. Improving the approximation to 4 requires us to apply Claim 5.2 only once.
Suppose are optimal 2-means centers for the clusters
be the minimum number of changes for any threshold cut
, and define
to the set of t points in the symmetric difference, where
Using the same argument as in the proof of Lemma 5.5, we have
The goal now is to bound the latter two terms using cost(opt). This term measures the distance of each from the “other” center, i.e., not c(x).
Claim C.1.
Using Claim C.1, together with Inequality (1) we have
and this completes the proof.
Proof of Claim C.1. Denote the Assume that the first
points are in the first optimal cluster,
, and the rest are in the second cluster,
Applying Lemma 4.3 for each coordinate ] guarantees t pairs of vectors (
) with the following properties. Each
corresponds to the ith coordinate of some point in
corresponds to the ith coordinate of some point in
. Furthermore, for each coordinate, the t pairs correspond to 2t distinct points in X. Finally, we can assume without loss of generality that
For each point
in the first
points in
, if
then we can replace
with
, thus we canassume without loss of generality that
We next show that cost(opt) is lower bounded by a function
of t. There will be two cases depending on whether or not. The harder case is the first where the improvement of the approximation from 6 to 4 arises. Instead of first bounding the distance between
and its new center using the distance to its original center and then accounting for
we directly account for the distance between
and its new center.
The last inequality follows from and
which imply that (
) + (
)
which means ((
Similarly for each point in the last
Putting these together we have
D.1 The 2-means case
The psudo-code for finding the best threshold for k = 2 depicted in Algorithm 2. In time O(d) we can calculate cost(p + 1) and the new centers by using the value cost(p) and the previous centers. Throughout the computation we save in memory
1. Two vectors
2. Scalar
We also make use of the identity:
This identity is correct because
By invoking this identity, we can quickly compute the cost of placing the first p points in cluster one and the last points in cluster two. Each such partition can be achieved by using a threshold
and
. Our algorithm computes these costs for each feature
]. Then, we output the feature i and threshold
that minimizes the cost. This guarantees that we find the best possible threshold cut.
Overall, Algorithm 2 iterates over the d features, and for each feature it sorts the n vectors according to their values in the current feature. Next, the algorithm iterates over the n vectors and for each potential threshold, it calculates the cost by evaluating the inner product of two d-dimensional vectors. Overall its runtime complexity is
D.2 The 2-medians case
The high level idea of a finding an optimal 2-medians cut is similar to the 2-means algorithm. The algorithm goes over all possible thresholds. For each threshold, it finds the optimal centers and calculates the cost accordingly. Then, it outputs the threshold cut that minimizes the 2-medians cost.
Updating cost. To update the cost we need to show how to express cost(p + 1) in terms of cost(p). We know that cost(p + 1) is equal to
For every feature ], there are
1 thresholds to consider. After sorting by this feature, we can consider all splits into
contains the p smallest points, and
contains the
largest points. We increase p from p = 1 to
1, computing the clusters and cost at each step. If p is odd then the median of
(i.e., the optimal center of
) does not change compared to
1. The only contribution to the cost is the point x that moved from
to
. If p is even, then at each coordinate there are two cases,
depending on whether the median changes or not. If it changes, then let ∆ denote the change in cost of the points in that are smaller than the median. By symmetry, the change in the cost of the points that are larger is
∆. Thus, the change of the cost is balanced by the points that are larger and smaller than the median. Similar reasoning holds for the other cluster
Therefore, we conclude that moving
changes the cost by exactly
. Thus, we have the following connection between cost(p + 1) and cost(p):
Updating centers. For each p, the cost update relies on efficient calculations of the centers ) and
+ 1). The centers
) are the medians of the clusters at the pth threshold. Note that moving from the pth thresold to the (p + 1)th will only change the clusters by moving one vector from one cluster to the other. We can determine the changes efficiently by using d arrays, one for each coordinate. Each array will contain (pointers to) the input vectors X sorted by their ith feature value. As we move the threshold along a single coordinate, we can read off the partition into two clusters, and we can compute the median of each cluster by considering the midpoint in the sorted list.
Overall, this procedure computes the cost of each threshold, while also determining the partition into two clusters and their centers (medians). The time is O(nd log n) to sort by each feature, and ) to compute cost(p) for each
] and each feature. Therefore, the total time for the 2-medians algorithm is