b

DiscoverSearch
About
My stuff
Condorcet's Jury Theorem for Consensus Clustering and its Implications for Diversity
2016·arXiv
Abstract
Abstract

Condorcet’s Jury Theorem has been invoked for ensemble classifiers to indicate that the combination of many classifiers can have better predictive performance than a single classifier. Such a theoretical underpinning is unknown for consensus clustering. This article extends Condorcet’s Jury Theorem to the mean partition approach under the additional assumptions that a unique ground-truth partition exists and sample partitions are drawn from a sufficiently small ball containing the ground-truth. As an implication of practical relevance, we question the claim that the quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partitions is necessary for controlling the quality.

Ensemble learning generates multiple models and combines them to a single consensus model to solve a learning problem. The assumption is that a consensus model performs better than an individual model or at least reduces the likelihood of selecting a model with inferior performance [37]. Examples of ensemble learning are classifier ensembles [6, 30, 39, 48] and cluster ensembles (consensus clustering) [15, 40, 44, 47].

The assumptions on ensemble learning follow the idea of collective wisdom that many heads are in general better than one. The idea of group intelligence applied to societies can be traced back to Aristotle and the philosophers of antiquity (see [45]) and has been recently revived by a number of publications, including James Surowiecki’s book The Wisdom of Crowds [41]. In his book, Surowiecki argues that not all crowds are wise, but in order to become wise, the crowd should comply to diversity of opinion and other criteria.

image

One theoretical basis for collective wisdom can be derived from Condorcet’s Jury Theorem [4]. The theorem refers to a jury of n voters that need to reach a decision by majority vote. The assumptions of the simplest version of the theorem are: (1) There are two alternatives; (2) one of both alternatives is correct; (3) voters decide independently; and (4) the probability p of a correct decision is identical for every voter. If the voters are competent, that is p > 0.5, then Condorcet’s Jury Theorem states that the probability of a correct decision by majority vote tends to one as the number n of voters increases to infinity.

Condorcet’s Jury Theorem has been generalized in several ways, because its assumptions are considered as rather restrictive and partly unrealistic (see e.g. [1] and references therein). Despite its practical limitations, the theorem has been used to indicate a theoretical justification of ensemble classifiers [30, 32, 39]. In contrast to ensemble classifiers, such a theoretical underpinning is unknown for consensus clustering.

image

Though the importance of diversity for classifier ensembles has been recognized long before Surowiecki’s book [29, 30], some authors and scholars additionally employ Surowiecki’s argument on diversity in retrospect for ensemble learning [39]. Inspired by the success of ensemble classifiers, diversity has also been promoted as a key factor for improving the quality of consensus clustering [18, 10, 19, 21, 31, 34, 36, 40, 44, 47]. However, a sound and consistent relationship between diversity and quality of consensus clustering could not yet be established. Instead, different empirical studies on the role of diversity in consensus clustering led to contradictory results [36].

image

This article addresses both problems: (1) the lack of a theoretical underpinning of consensus clustering with regard to the paradigm of collective wisdom; and (2) the lack of a consistent relationship between diversity and quality.

With regard to the first problem, we extend Condorcet’s Jury Theorem to the mean partition approach [5, 7, 11, 12, 16, 33, 40, 42, 43]. Here, we consider the special case that the partition space is endowed with a metric induced by the Euclidean norm. Then the proposed theorem draws on the following assumptions: (1) there is a unique (possibly unknown) ground-truth partition  X∗; and (2) sample partitions are drawn i.i.d. from a sufficiently small ball containing  X∗.

With regard to the second problem, this article analyzes the role of diversity in light of Condorcet’s Jury Theorem within a special problem setting. We question the claim that the quality of consensus clustering depends on the diversity of the sample partitions generated by the cluster ensemble. Instead, we conjecture that limiting the diversity of the mean partitions is necessary to control the quality. This finding would explain the contradictory results of previous empirical studies on diversity as outlined and discussed in [36].

image

The rest of this paper is structured as follows: Section 2 introduces background material. In Section 3 we present the extended version of Condorcet’s Jury Theorem. Section 4 discusses the role of diversity and Section 5 concludes. Proofs are delegated to the appendix.

2.1. The Mean Partition Approach

The goal is to group a set  Z = {z1, . . . , zm}of m data points into  ℓclusters. The mean partition approach first clusters the same data set Z several times using differ-ent settings and strategies of the same or different cluster algorithms. The resulting clusterings form a sample  Sn= (X1, . . . , Xn) of n partitions  Xi ∈ Pof data set Z. The mean partition approach aims at finding a consensus clustering that minimizes a sum-of-distances criterion from the sample partitions. In Section 3, we specify the underlying partition space and in Section 3.3 we present a formal definition of the mean partition approach.

2.2. Context of the Mean Partition Approach

We place the mean partition approach into the broader context of mathematical statistics. The motivation is that mathematical statistics offers a plethora of useful results, the consensus clustering literature seems to be unaware of. For example, the proof of Condorcet’s Jury Theorem rests on results from statistical analysis of graphs [25]. These results in turn are rooted on Fr´echet’s seminal monograph [13] and its follow-up research.

Since a meaningful addition of partitions is unknown, the mean partition approach emulates an averaging procedure by minimizing a sum-of-distances criterion. This idea is not new and has been studied in more general form for almost seven decades. In 1948, Fr´echet first generalized the idea of averaging in metric spaces, where a well-defined addition is unknown. He showed that specification of a metric and a probability distribution is sufficient to define a mean element as measure of central tendency. The mean of a sample of elements is any element that minimizes the sum of squared distances from all sample elements. Similarly, the expectation of a probability distribution minimizes an integral of the sum of squared distances from all elements of the entire space.

Since Fr´echet’s seminal work, mathematical statistics studied asymptotic and other properties of the mean element in abstract metric spaces. Examples include statistical analysis of shapes [2, 8, 20, 28], complex objects [35, 46], tree-structured data [9, 46], and graphs [14, 25].

The partition spaces defined in Section 3.2 can be regarded as a special case of graph spaces [22, 23]. Consequently, the geometric as well as statistical properties of graph spaces carry over to partition spaces. The proof of the proposed theorem rests on the orbit space framework [22, 23], on the mean partition theorem in graph spaces, and on asymptotic properties of the sample mean of graphs [25] that have been adopted to partition spaces [24, 27].

This section first introduces partition spaces endowed with a metric induced by the Euclidean norm. Then we formalize the mean partition approach using Fr´echet functions.

Throughout this contribution, we assume that  Z = {z1, . . . , zm}is a set of m data points to be clustered and  C = {c1, . . . , cℓ}is a set of  ℓcluster labels.

3.1. Partitions and their Representations

Partitions usually occur in two forms, in a labeled and in an unlabeled form, where labeled partitions can be regarded as representations of unlabeled partitions.

We begin with describing labeled partitions. Let  1d ∈ Rddenote the vector of all ones. Consider the set

image

of matrices with elements from the unit interval and whose columns sum to one. A matrix  X ∈ Xrepresents a labeled (soft) partition of Z. The elements  xkjof X = (xkj) describe the degree of membership of data point  zjto the cluster with label  ck. The columns  x:jof X summarize the membership values of the data points zjacross all  ℓclusters. The rows  xk:of X represent the clusters  ck.

Next, we describe unlabeled partitions. Observe that the rows of a labeled partition X describe a cluster structure. Permuting the rows of X results in a labeled partition X′with the same cluster structure but with a possibly different labeling of the clusters. In cluster analysis, the particular labeling of the clusters is usually meaningless. What matters is the abstract cluster structure represented by a labeled partition. Since there is no natural labeling of the clusters, we define the corresponding unlabeled partition as the equivalence class of all labeled partitions that can be obtained from one another by relabeling the clusters. Formally, an unlabeled partition is a set of the form

image

where Πℓis the set of all (ℓ × ℓ)-permutation matrices.

In the following, we briefly call X a partition instead of unlabeled partition. In addition, any labeled partition  X′ ∈ Xis called a representation of partition X. By P we denote the set of all (unlabeled) partitions with  ℓclusters over m data points. Since some clusters may be empty, the set P also contains partitions with less than  ℓclusters. Thus, we consider  ℓ ≤ mas the maximum number of clusters we encounter.

A  hard partition X ∈ Pis a partition whose matrix representations take only binary membership values from {0, 1}. By  P+we denote the subset of all hard partitions. Note that the columns of representations of hard partitions are standard basis vectors from  Rℓ.

Though we are only interested in unlabeled partitions, we still need labeled partitions for two reasons: (1) computers can not easily and efficiently cope with unlabeled partitions unless the clusters carry labels in terms of number or names; and (2) using labeled partitions considerably simplifies derivation of theoretical results.

3.2. Intrinsic Metric

We endow the set P of partitions with an intrinsic metric  δinduced by the Euclidean norm such that (P, δ) becomes a geodesic space. The Euclidean norm for matrices X ∈ Xis defined by

image

The norm  ∥X∥is also known as the Frobenius or Schur norm. We call  ∥X∥Euclidean norm in order to emphasize the geometric properties of the partition space. The Euclidean norm induces the distance function

image

for all partitions  X, Y ∈ P. Then the pair (P, δ) is a geodesic metric space [24], Theorem 2.1. Suppose that X and Y are two partitions. Then

image

for all representations  X ∈ Xand  Y ∈ Y. For some pairs of representations  X′ ∈ Xand  Y ′ ∈ Yequality holds in Eq. (1). In this case, we say that representations  X′and  Y ′are in optimal position. Note that pairs of representations in optimal position are not uniquely determined.

3.3. Fr´echet Functions

We first formalize the mean partition approach using Fr´echet functions. Then we present the Mean Partition Theorem, which is of pivotal importance for gaining deeper insight into the theory of the mean partition approach [27]. Here, we apply the Mean Partition Theorem to define the concept of majority vote. In addition, the proof of the proposed theorem resorts to the properties stated in the Mean Partition Theorem.

Let (P, δ) be a partition space endowed with the metric  δinduced by the Euclidean norm. We assume that Q is a probability dsitribution on P with support  SQ.1Suppose that  Sn= (X1, X2, . . . , Xn) is a sample of n partitions  Xidrawn i.i.d. from the

probability distribution Q. Then the Fr´echet function of  Snis of the form

image

A mean partition of sample  Snis any partition  M ∈ Psatisfying

image

Note that a mean partition needs not to be a member of the support. In addition, a mean partition exists but is not unique, in general [24].

The Mean Partition Theorem proved in [27] states that any representation M of a local minimum M of  Fnis the standard mean of sample representations in optimal position with M.

Theorem 1. Let  Sn= (X1, . . . , Xn) ∈ Pnbe a sample of n partitions. Suppose that  M ∈ Pis a local minimum of the Fr´echet function  Fn(Z) of  Sn. Then every representation M of M is of the form

image

Condorcet’s original theorem is an asymptotical statement about the majority vote. To adopt this statement, we introduce the notion of expected partition. An expected partition of probability distribution Q is any partition  MQ ∈ Pthat minimizes the expected Fr´echet function

image

As for the sample Fr´echet function  Fn, the minimum of the expected Fr´echet function FQexists but but is not unique, in general [24].

This section extends Condorcet’s Jury Theorem to the partition space defined in Section 3.2.

4.1. The General Setting

Theorem 2 extends Condorcet’s Jury Theorem for hard partitions. Generalization to arbitrary partitions is out of scope and left for future research.

The general setting of Theorem 2 is as follows: Let  Sn= (X1, . . . , Xn) be a sample of n hard partitions  Xi ∈ P+drawn i.i.d. from a probability distribution Q. Each of the sample partitions  Xihas a vote on a given data point  z ∈ Zwith probability  pi(z) of being correct. The goal is to reach a final decision on data point z by majority vote. Theorem 2 makes an asymptotic statement about the correctness of the majority vote given the probabilities  pi.

To formulate Theorem 2, we need to define the concepts of vote and majority vote. The majority vote is based on the mean partition of a sample and is not necessarily a hard partition. Since the mean partition itself votes, we introduce votes for arbitrary (soft and hard) partitions and later restrict ourselves to samples of hard partitions when defining the majority vote.

Assumption. In the following, we assume existence of a possibly unknown but unique hard ground-truth partition  X∗ ∈ P+. By  X∗we denote an arbitrarily selected but fixed representation of  X∗.

4.2. Votes

We model the vote of a partition  X ∈ Pon a given data point  z ∈ Z. The vote of X on z has two possible outcomes: The vote is correct if X agrees on z with the ground-truth  X∗, and the vote is wrong otherwise. To model the vote of a partition, we need to specify what we mean by agreeing on a data-point with the ground-truth. An agreement function of representation X of X is a function of the form

image

where  x:jand  x∗:jare the j-th columns of the representations X and  X∗, respectively. A column of a matrix represents the membership values of the corresponding data point across all clusters. Then the value  kX(zj) measures how strongly representation X agrees with the ground-truth  X∗on data point  zj. If X is a hard partition, then kX(z) = 1 if z occurs in the same cluster of X and  X∗, and  kX(z) = 0 otherwise.

The vote of representation X of partition X on data point z is defined by

image

where I {b} is the indicator function that gives 1 if the boolean expression b is true, and 0 otherwise. Observe that  kX = VXfor hard partitions  X ∈ P+.

Based on the vote of a representation we can define the vote of a partition. The vote of partition is a Bernoulli distributed random variable. We randomly select a representation X of partition X in optimal position with  X∗. Then the vote  VX(z) of X on data point z is  VX(z). By

image

we denote the probability of a correct vote of partition X on data point z. Note that the probability  pX(z) is independent of the particular choice of representation  X∗of the ground-truth partition  X∗.

4.3. Majority Vote

We assume that  Sn= (X1, . . . , Xn) is a sample of n hard partitions  Xi ∈ P+drawn i.i.d. from a cluster ensemble. We define a majority vote  Vn(z) of sample  Snon z as follows: We randomly select a mean partition M of  Snand then set the majority vote Vn(z) on z to the vote  VM(z) of the chosen mean partition  M.2

It remains to show that the vote  VM(z) of any mean partition M of  Snis indeed a majority vote. To see this, we invoke the Mean Partition Theorem. Any representation M of mean partition M is of the form

image

where  Xi ∈ Xiare representations in optimal position with M. For a given data point  zj ∈ Z, the mean membership values are given by

image

where  x(i):jdenotes the j-th column of representation  Xi. Since the columns of  x(i):jare standard basis vectors, the elements  mkjof the j-th column  m:jcontain the relative frequencies with which data point  zjoccurs in cluster  ck. Then the vote  VM(zj) is correct if and only if the agreement function of M satisfies

image

This in turn implies that there is a majority  mkj > 0.5 for some cluster  ck, because X∗is a hard partition by assumption.

4.4. Condorcet’s Jury Theorem

Roughly, Condorcet’s Jury Theorem states that the majority vote tends to be correct when the individual voters are independent and competent. In consensus clustering, the majority vote is based on mean partitions. Individual sample partitions  Xiare competent on data point  z ∈ Zif the probability of a correct vote on z is given by  pi(z) > 0.5. In the spirit of Condorcet’s Jury Theorem, we want to show that the probability  P(hn(z) = 1) of the majority vote  hn(z) tends to one with increasing sample size n.

In general, mean partitions are neither unique nor converge to a unique expected partition. This in turn may result in a non-convergent sequence (hn(z))n∈Nof majority votes for a given data points z. In this case, it is not possible to establish convergence in probability to the ground-truth. To cope with this problem, we demand that the sample partitions are all contained in a sufficiently small ball, called asymmetry ball. The  asymmetry ball AZof partition  Z ∈ Pis the subset of the form

image

where  αZis the degree of asymmetry of Z defined by

image

A partition Z is asymmetric if  αZ >0. If  αZ= 0 the partition Z is called symmetric. Any partition whose representations have mutually distinct rows is an asymmetric partition. Conversely, a partition is symmetric if it has a representation with at least two identical rows. We refer to [26] for more details on asymmetric partitions.

By  A◦Zwe denote the largest open subset of  AZ. If Z is symmetric, then  A◦Z=  ∅be definition. Thus, a non-empty set  A◦Zentails that Z is symmetric.

A probability distribution Q is homogeneous if there is a partition Z such that the support  SQof probability distribution Q is contained in the asymmetry ball  A◦Z. A sample  Snis said to be homogeneous if the sample partitions of  Snare drawn from a homogeneous distribution Q.

Now we are in the position to present Condorcet’s Jury Theorem for the mean partition approach. For a proof we refer to the appendix.

Theorem 2 (Condorcet’s Jury Theorem). Let Q be a probability measure on  P+with support  SQ. Suppose the following assumptions hold:

1. There is a partition  Z ∈ Psuch that  X∗ ∈ A◦Zand  SQ ⊆ A◦Z.

2. Hard partitions  X1, . . . , Xn ∈ P+are drawn i.i.d. according to Q.

3. Let  z ∈ Z. Then  pz = pX(z) is constant for all  X ∈ SQ.

image

Equation (2) corresponds to Condorcet’s original theorem for majority vote on a single data point and Eq. (3) shows that the sequence of mean partitions converges almost surely to the ground-truth partition. Observe that almost sure convergence in Eq. (3) also holds when the probabilities  pzdiffer for different data points  z ∈ Z. From the proof of Condorcet’s Jury Theorem follows that the ground-truth partition X∗is an expected partition almost surely and therefore takes the form as described in the Expected Partition Theorem [27].

This section discusses the role of diversity under the assumptions of a special setting. We question the prevailing claim that quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partition set is necessary to control the quality.

5.1. Diversity and Quality

This section specifies the notion of diversity and quality for the subsequent analysis.

image

Suppose that S is either a sample or a set consisting of n elements  X1, . . . , Xn. There are two common approaches to measure the diversity of S. Both approaches assume a dissimilarity function ∆ :  P ×P → R. Typical choices for ∆ are based on the adjusted Rand index, the Jaccard index, and the Normalized Mutual Information. The first approach averages the pairwise dissimilarities

image

and the second approach is the variation

image

where M is either a medoid or mean partition of S. A simple calculation shows that both diversity measures are closely related by the inequality

image

For the following analysis, we assume that ∆ =  δ2. In this case, the diversity of a homogeneous sample (set) S is bounded by

image

where  αZis the degree of asymmetry of a partition Z whose asymmetry ball  AZincludes S. We say, a sample (set) is diverse if it is not homogeneous.

image

Let  X∗be the ground-truth partition to be predicted as good as possible. For this, we need to specify the quality of a prediction, that is what we exactly mean by the term ”as good as possible”. We introduce the loss L(X) =  δ(X, X∗) of a partition X as a

measure of how well X predicts the ground-truth  X∗.

To measure the quality of consensus clustering, we assume that  S ⊆ Pis a sample

image

is the worst-case loss of S. Consensus clustering generates a sample  Snof sample partitions. The sample  Sndetermines the non-empty and finite set  Fnof mean partitions. A mean algorithm picks a mean partition M from  Fnto predict the ground-truth X∗. We define the quality of consensus clustering by the worst-case loss  L∗(Fn). We consider the worst-case loss rather than the average or expected loss to keep the subsequent analysis simple.

5.2. The Role of Diversity

Diversity of the sample partitions is a basic prerequisite for consensus clustering. Without diversity, consensus clustering reproduces the same partition. In this case, the resulting mean partition is merely another replication.

Research on diversity in consensus clustering assumes that the quality of consensus clustering depends on the diversity of the sample. However, there is a significant gap in argument between the fact that diversity is a basic prerequisite to make sense of consensus clustering and the assumption that quality depends on diversity. The hope is that this gap can be bridged by an appropriate but currently unknown diversity measure. The proof of Theorem 2 rejects this hope.

Theorem 2 includes the assumptions of Condorcet’s original theorem and adds two further assumptions: (1) existence of a unique ground-truth partition, and (2) existence of a sufficiently small ball that contains the ground-truth partition as well as any sample partition.

The first assumption transfers the notion of correct decision in Condorcet’s original formulation to the context of consensus clustering. The second assumption demands homogeneity of the probability measure. At first glance, homogeneity seems to contradict the assumption that diversity of the sample is a decisive factor for improving the quality of consensus clustering.

A closer look at the proof of Theorem 2 shows that neither homogeneity nor diversity of the sample matters. What matters is that the mean partition is a consistent estimator of the ground-truth. Homogeneous probability distributions merely form one among several other classes of distributions for which consistency of the mean partition to a unique expected partition is ensured. Diversity of the sample can entail both, the desired consistency properties as well as diversity of the mean and expected partitions. In the latter case, Theorem 2 fails and the mean partition returned by the consensus clustering method may have nothing in common with the ground-truth.

The claim is that diversity of the sample partitions is neither necessary nor sufficient for finding mean partitions close to the ground-truth, whereas limiting the diversity of the mean partition set is necessary but not sufficient to control the quality of consensus clustering.

5.3. Loss Decomposition

Inspired by the bias-variance decomposition in supervised learning, we discuss the different sources of error in consensus clustering to understand how we can improve its quality. For this, we call a cluster ensemble stable (unstable) if its mean partition set is homogeneous (diverse). Suppose that  Fn ⊆ Pis a mean partition set. Then

image

is the approximation error of F. The approximation error tells us how close the best solution we can attain is to the ground-truth. Then the worst-case loss  L∗(Fn) of a mean partition set  Fncan be expressed as

image

The approximation error is caused by the choice and design of the underlying clustering ensemble method. The estimation error is caused by the random process of generating partitions. This error measures the difference between the loss in the worst and best case. If the set  Fnof mean partitions is homogeneous (diverse), then the estimation error is small (large).

For homogeneous distributions, the set  Fnis a singleton. Therefore, the estimation error is always zero, but the approximation error is likely to be large, when prior knowledge about the data is not considered. More generally, stable approaches guarantee a small estimation error and thereby impose a strong bias about the kind of cluster structure the underlying ensemble method is looking for. If the assumption on the cluster structure does not match the ground-truth, the approximation error will be large. The usual way to improve stable approaches is to incorporate competence by means of prior knowledge and domain expertise in order to hopefully reduce the approximation error. Under the special assumptions of Theorem 2, the estimation error is zero and the worst-case loss  L∗(Mn) converges almost surely to zero as the sample size n tends to infinity.

In contrast, unstable approaches result in large estimation errors, but may more likely have a lower approximation error than stable approaches. Such approaches are less biased about the kind of cluster structure the underlying ensemble method is looking for. Unstable approaches shift the problem of assuming a certain kind of cluster structure in the data to the problem of identifying a mean partition close to the ground-truth. In other words, the bias on the kind of cluster structure is shifted to a bias on the choice of mean partition. The problem is that existing mean algorithms are unable to cope with the latter bias, because the Fr´echet function contains no information about the ground-truth. Consequently, a mean algorithm regards the diverse sample means as equivalent solutions of the same minimization problem. Thus, obtaining a mean partition with low loss is merely due to chance.

Another issue is that unstable approaches are prone to the Texas sharpshooter fallacy: Suppose a data set Z is given for which the ground truth is known. Repeatedly apply a consensus clustering method to Z. In each trial, consensus clustering generates sample partitions and returns a mean partition. The mean partitions of the different trials can differ substantially. In this case, select the top N mean partitions closest to the known ground-truth and claim that diversity improves quality.

Finally, it is unclear how to improve the quality of unstable approaches without turning them into stable versions. A possible solution could be to derive the mean partitions of the second order, that is the mean partitions of the mean partition set. If the mean partition set is diverse, it can happen that mean partition sets of higher order remain diverse. Empirical results could shed light on this issue but require enumerating a subset of mean partitions, which is computationally expensive. Also incorporating prior knowledge will not improve the quality of unstable approaches, because the mean partitions are diverse.

Though we analyzed the relationship between diversity and quality in a special setting, we assume that the main findings carry over to other consensus clustering approaches as well as other quality and diversity measures, unless they address the key issue of non-uniqueness of the mean and expected partition. We therefore conjecture that improved results obtained by unstable approaches are due to chance and not any systematic factor such as diversity of the sample partitions. To improve quality, we advocate to limit the diversity of the mean partitions and to incorporate prior knowledge.

This contribution extends Condorcet’s Jury Theorem to partition spaces endowed with a metric induced by the Euclidean norm under the following additional assumptions: (i) existence of a hard ground-truth partition, and (ii) all sample partitions and the ground-truth are contained in some asymmetry ball. In light of the proposed theorem and its assumptions, we show that the quality of consensus clustering is independent of the diversity of the sample partitions. To improve quality, we advocate to limit diversity of the mean partitions to keep the estimation error low and incorporate prior knowledge to reduce the approximation error. We regard this finding as helpful for devising better consensus clustering algorithms. Future research aims at extending the proposed theorem by relaxing assumptions (i) and (ii) and at empirically investigating the conjecture on diversity of the mean partitions.

image

To prove Theorem 2, it is helpful to use a suitable representation of partitions. We suggest to represent partitions as points of some geometric space, called orbit space [24]. Orbit spaces are well explored, possess a rich geometrical structure and have a natural connection to Euclidean spaces [3, 23, 38].

A.1. Partition Spaces

image

The group Π = Πℓ of all (ℓ × ℓ)-of all (ℓ × ℓ)-permutation matrices is a discontinuous group that acts on X by matrix multiplication, that is

image

The orbit of  X ∈ Xis the set [X] = {P X : P ∈ Π}. The orbit space of partitions is the quotient space  X /Π = {[X] : X ∈ X }obtained by the action of the permutation group Π on the set X . We write P = X /Π to denote the partition space and  X ∈ Pto denote an orbit [X] ∈ X /Π. The natural projection  π : X → Psends matrices X to the partitions π(X) = [X] they represent. The partition space P is endowed with the intrinsic metric  δdefined by  δ(X, Y ) = min {∥X − Y ∥ : X ∈ X, Y ∈ Y }.

A.2. Dirichlet Fundamental Domains

We use the following notations: ByU we denote the closure of a subset  U ⊆ X , by ∂U theboundary of  U, and by U◦ the open subsetU \ ∂U. The action of permutation  P ∈ Π on thesubset  U ⊆ Xis the set defined by  P U = {P X : X ∈ U}. By Π∗ = Π \ {I}we denote the subset of (ℓ × ℓ)-permutation matrices without identity matrix I.

A subset F of X is a fundamental set for Π if and only if F contains exactly one representation X from each orbit [X] ∈ X /Π. A fundamental domain of Π in X is a closed connected set  F ⊆ Xthat satisfies

image

Lemma 1. Let DZbe a Dirichlet fundamental domain of representation Z of an asymmetric partition  Z ∈ P. Suppose that  X and X′ are two different representations of a partition X such that  X, X′ ∈ DZ. Then X, X′ ∈ ∂DZ.

image

A.3. Multiple Alignments

Let  Sn = (X1, . . . , Xn) be a sample of n partitions  Xi ∈ P. A multiple alignment of  Sn is ann-tuple X = (X1, . . . , Xn) consisting of representations  Xi ∈ Xi. By

image

we denote the set of all multiple alignments of  Sn. A multiple alignment  X = (X1, . . . , Xn)is said to be in optimal position with representation Z of a partition Z, if all representations Xi of Xare in optimal position with Z. The mean of a multiple alignment  X = (X1, . . . , Xn)is denoted by

image

The problem of finding an optimal multiple alignment is that of finding a multiple alignment with smallest average pairwise squared distances in X . To show equivalence between mean partitions and an optimal multiple alignments, we introduce the sets of minimizers of the respective functions  Fn and fn:

image

For a given sample  Sn, the set M(Fn) is the mean partition set and  M(fn) is the set of all optimal multiple alignments. The next result shows that any solution of  Fnis also a solution of  fnand vice versa.

image

A.4. Proof of Theorem 2

image

1 Without loss of generality, we pick a representation  X∗of the ground-truth partition  X∗.Let Z be a representation of Z in optimal position with  X∗. By

image

image

3 From [26], Theorem 3.1 follows that the mean partition  M of Snis unique. We show that M ∈ AZ. Suppose that  X = (X1, . . . , Xn) is a multiple alignment in optimal position with Z. Since φ : AZ → AZis a bijective isometry, we have

image

is a representation of a mean partition  M of Sn. Since AZis convex, we find that  M ∈ AZand therefore  M ∈ AZ.

4 From Part 1–3 of this proof follows that the multiple alignment X is in optimal position with  X∗.We show that there is no other multiple alignment of  Snwith this property. Observe that  AZis contained in the Dirichlet fundamental domain  DZof representation Z. Let SZ = φ(SQ) be a representation of the support in  A◦Z. Then by assumption, we have  SZ ⊆ A◦Z ⊂ DZ showing that  SZ lies in the interior of  DZ. From the definition of a fundamental domain together with Lemma 1 follows that X is the unique optimal alignment in optimal position with  X∗.

5 With the same argumentation as in the previous part of this proof, we find that M is the unique representation of M in optimal position with  X∗.

6 Let z ∈ Zbe a data point. Since  Xi ∈ Xiis the unique representation in optimal position with  X∗, the vote of  Xion data point z is of the form  VXi(z) = VXi(z) for all i ∈ {1, . . . , n}.With the same argument, we have  Vn(z) = VM(z) = VM(z).

7 By x(i)(z) we denote the column of  Xithat represents z. By definition, we have

image

for all  i ∈ {1, . . . , n}. Since Xi and X∗are both hard partitions, we find that �x(i)(z), x∗(z)�= I�x(i)(z) = x∗(z)�,

image

image

Thus, the agreement  kM(z) counts the fraction of sample partitions  Xithat correctly classify

image

denote the probability that the majority of the sample partitions  Xicorrectly classifies z. Since the votes of the sample partitions are assumed to be independent, we can compute  pnusing the binomial distribution

image

where  r = ⌊n/2⌋ + 1 and ⌊a⌋is the largest integer  b with b ≤ a. Then the assertion of Eq. (2) follows from [17], Theorem 1.

9 We show the assertion of Eq. (3). By assumption, the support  SQis contained in an open subset of the asymmetry ball  AZ. From [26], Theorem 3.1 follows that the expected partition MQ of Qis unique. Then the sequence (Mn)n∈Nconverges almost surely to the expected partition  MQaccording to [24], Theorem 3.1 and Theorem 3.3. From the first eight parts of the proof follows that the limit partition  MQagrees on any data point z almost surely with the ground-truth partition  X∗. This shows the assertion.

[1] D. Berend and J. Paroush. When is Condorcet’s Jury Theorem valid? Social Choice and Welfare, 15(4):481–488, 1998. [2] A. Bhattacharya and R. Bhattacharya. Nonparametric Inference on Manifolds with Applications to Shape Spaces. Cambridge University Press, 2012. [3] G. E. Bredon. Introduction to Compact Transformation Groups. Elsevier, 1972. [4] N.C. de Condorcet. Essai sur l’application de l’analyse `a la probabilit´e des d´ecisions rendues `a la pluralit´e des voix. Imprimerie Royale, Paris, 1785. [5] E. Dimitriadou, A. Weingessel, and K. Hornik. A Combination Scheme for Fuzzy Clus- tering. Advances in Soft Computing, 2002.

image

[10] X.Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128–141, 2008. [11] V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. Interna-

image

[12] L. Franek and X. Jiang. Ensemble clustering by means of clustering embedding in vector

image

[14] C.E. Ginestet. Strong Consistency of Fr´echet Sample Mean Sets for Graph-Valued Ran- dom Variables. arXiv: 1204.3183, 2012. [15] R. Ghaemi, N. Sulaiman, H. Ibrahim, and N. Mustapha. A Survey: Clustering Ensembles

image

[16] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on

image

[17] B. Grofman, G. Owen, and S.L. Feld. Thirteen theorems in search of the truth. Theory

image

[18] S.T. Hadjitodorov, L.I. Kuncheva, and L.P. Todorova. Moderate diversity for better

image

[19] J. Hu, T. Li, H. Wang, and H. Fujita. Hierarchical cluster ensemble model based on

image

[20] S. Huckemann, T. Hotz, and A. Munk. Intrinsic shape analysis: Geodesic PCA for

image

[21] N. Iam-On, T. Boongoen, S. Garrett, and C. Price. A link-based approach to the cluster

image

[22] B.J. Jain and K. Obermayer. Structure Spaces. Journal of Machine Learning Research,

image

[23] B.J. Jain. Geometry of Graph Edit Distance Spaces. arXiv: 1505.08071, 2015. [24] B.J. Jain. Asymptotic Behavior of Mean Partitions in Consensus Clustering.

image

[27] B.J. Jain. The Mean Partition Theorem of Consensus Clustering. arXiv:1604.06626,

2016. [28] D.G. Kendall. Shape manifolds, Procrustean metrics, and complex projective spaces.

image

[29] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and active learn-

image

[30] L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley &

image

[31] L.I. Kuncheva and S.T. Hadjitodorov. Using diversity in cluster ensembles. IEEE International Conference on Systems, Man and Cybernetics, 2:1214–1219, 2004.

[32] L. Lam and C.Y. Suen. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man and Cybernetics – Part A: Systems and Humans, 27(5):553-568, 1997.

[33] T. Li, C. Ding and M.I. Jordan. Solving consensus and semi-supervised clustering prob-

image

[34] T. Li and C. Ding. Weighted consensus clustering. SIAM International Conference on

image

[35] J.S. Marron and A.M. Alonso. Overview of object oriented data analysis. Biometrical Journal, 56(5):732–753, 2014.

[36] M. Pividori, G. Stegmayer, and D. Milone. Diversity control for improving the analysis of consensus clustering. Information Sciences, 361–362:120–134, 2016.

image

[39] L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010. [40] A. Strehl and J. Ghosh. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583–617, 2002.

image

[42] A.P. Topchy, A.K. Jain, and W. Punch. Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions in Pattern Analysis and Machine Intelligence, 27(12):1866–1881, 2005.

[43] S. Vega-Pons, J. Correa-Morris and J. Ruiz-Shulcloper. Weighted partition consensus via kernels. Pattern Recognition, 43(8):2712–2724, 2010.

[44] S. Vega-Pons and J. Ruiz-Shulcloper. A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25(03):337-372, 2011.

[45] J. Waldron. The Wisdom of the Multitude: Some Reflections on Book III Chapter 11 of the Politics. Political Theory 23:563–84, 1995.

[46] H. Wang and J.S. Marron. Object oriented data analysis: sets of trees. The Annals of Statistics 35:1849–1873, 2007.

[47] F. Yang, X. Li, Q. Li, and T. Li. Exploring the diversity in cluster ensemble gener- ation: Random sampling and random projection. Expert Systems with Applications, 41(10):4844–4866, 2014.

[48] Z. Zhou. Ensemble Methods: Foundations and Algorithms. Taylor & Francis Group, LLC, 2012.


Designed for Accessibility and to further Open Science