Condorcet's Jury Theorem for Consensus Clustering and its Implications for Diversity

2016·Arxiv

Abstract

Abstract

Condorcet’s Jury Theorem has been invoked for ensemble classifiers to indicate that the combination of many classifiers can have better predictive performance than a single classifier. Such a theoretical underpinning is unknown for consensus clustering. This article extends Condorcet’s Jury Theorem to the mean partition approach under the additional assumptions that a unique ground-truth partition exists and sample partitions are drawn from a sufficiently small ball containing the ground-truth. As an implication of practical relevance, we question the claim that the quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partitions is necessary for controlling the quality.

1. Introduction

Ensemble learning generates multiple models and combines them to a single consensus model to solve a learning problem. The assumption is that a consensus model performs better than an individual model or at least reduces the likelihood of selecting a model with inferior performance [37]. Examples of ensemble learning are classifier ensembles [6, 30, 39, 48] and cluster ensembles (consensus clustering) [15, 40, 44, 47].

The assumptions on ensemble learning follow the idea of collective wisdom that many heads are in general better than one. The idea of group intelligence applied to societies can be traced back to Aristotle and the philosophers of antiquity (see [45]) and has been recently revived by a number of publications, including James Surowiecki’s book The Wisdom of Crowds [41]. In his book, Surowiecki argues that not all crowds are wise, but in order to become wise, the crowd should comply to diversity of opinion and other criteria.

One theoretical basis for collective wisdom can be derived from Condorcet’s Jury Theorem [4]. The theorem refers to a jury of n voters that need to reach a decision by majority vote. The assumptions of the simplest version of the theorem are: (1) There are two alternatives; (2) one of both alternatives is correct; (3) voters decide independently; and (4) the probability p of a correct decision is identical for every voter. If the voters are competent, that is p > 0.5, then Condorcet’s Jury Theorem states that the probability of a correct decision by majority vote tends to one as the number n of voters increases to infinity.

Condorcet’s Jury Theorem has been generalized in several ways, because its assumptions are considered as rather restrictive and partly unrealistic (see e.g. [1] and references therein). Despite its practical limitations, the theorem has been used to indicate a theoretical justification of ensemble classifiers [30, 32, 39]. In contrast to ensemble classifiers, such a theoretical underpinning is unknown for consensus clustering.

Though the importance of diversity for classifier ensembles has been recognized long before Surowiecki’s book [29, 30], some authors and scholars additionally employ Surowiecki’s argument on diversity in retrospect for ensemble learning [39]. Inspired by the success of ensemble classifiers, diversity has also been promoted as a key factor for improving the quality of consensus clustering [18, 10, 19, 21, 31, 34, 36, 40, 44, 47]. However, a sound and consistent relationship between diversity and quality of consensus clustering could not yet be established. Instead, different empirical studies on the role of diversity in consensus clustering led to contradictory results [36].

This article addresses both problems: (1) the lack of a theoretical underpinning of consensus clustering with regard to the paradigm of collective wisdom; and (2) the lack of a consistent relationship between diversity and quality.

With regard to the first problem, we extend Condorcet’s Jury Theorem to the mean partition approach [5, 7, 11, 12, 16, 33, 40, 42, 43]. Here, we consider the special case that the partition space is endowed with a metric induced by the Euclidean norm. Then the proposed theorem draws on the following assumptions: (1) there is a unique (possibly unknown) ground-truth partition ; and (2) sample partitions are drawn i.i.d. from a sufficiently small ball containing .

With regard to the second problem, this article analyzes the role of diversity in light of Condorcet’s Jury Theorem within a special problem setting. We question the claim that the quality of consensus clustering depends on the diversity of the sample partitions generated by the cluster ensemble. Instead, we conjecture that limiting the diversity of the mean partitions is necessary to control the quality. This finding would explain the contradictory results of previous empirical studies on diversity as outlined and discussed in [36].

The rest of this paper is structured as follows: Section 2 introduces background material. In Section 3 we present the extended version of Condorcet’s Jury Theorem. Section 4 discusses the role of diversity and Section 5 concludes. Proofs are delegated to the appendix.

2. Background and Related Work

2.1. The Mean Partition Approach

The goal is to group a set of m data points into clusters. The mean partition approach first clusters the same data set Z several times using differ-ent settings and strategies of the same or different cluster algorithms. The resulting clusterings form a sample = () of n partitions of data set Z. The mean partition approach aims at finding a consensus clustering that minimizes a sum-of-distances criterion from the sample partitions. In Section 3, we specify the underlying partition space and in Section 3.3 we present a formal definition of the mean partition approach.

2.2. Context of the Mean Partition Approach

We place the mean partition approach into the broader context of mathematical statistics. The motivation is that mathematical statistics offers a plethora of useful results, the consensus clustering literature seems to be unaware of. For example, the proof of Condorcet’s Jury Theorem rests on results from statistical analysis of graphs [25]. These results in turn are rooted on Fr´echet’s seminal monograph [13] and its follow-up research.

Since a meaningful addition of partitions is unknown, the mean partition approach emulates an averaging procedure by minimizing a sum-of-distances criterion. This idea is not new and has been studied in more general form for almost seven decades. In 1948, Fr´echet first generalized the idea of averaging in metric spaces, where a well-defined addition is unknown. He showed that specification of a metric and a probability distribution is sufficient to define a mean element as measure of central tendency. The mean of a sample of elements is any element that minimizes the sum of squared distances from all sample elements. Similarly, the expectation of a probability distribution minimizes an integral of the sum of squared distances from all elements of the entire space.

Since Fr´echet’s seminal work, mathematical statistics studied asymptotic and other properties of the mean element in abstract metric spaces. Examples include statistical analysis of shapes [2, 8, 20, 28], complex objects [35, 46], tree-structured data [9, 46], and graphs [14, 25].

The partition spaces defined in Section 3.2 can be regarded as a special case of graph spaces [22, 23]. Consequently, the geometric as well as statistical properties of graph spaces carry over to partition spaces. The proof of the proposed theorem rests on the orbit space framework [22, 23], on the mean partition theorem in graph spaces, and on asymptotic properties of the sample mean of graphs [25] that have been adopted to partition spaces [24, 27].

3. Fr´echet Functions on Partition Spaces

This section first introduces partition spaces endowed with a metric induced by the Euclidean norm. Then we formalize the mean partition approach using Fr´echet functions.

Throughout this contribution, we assume that is a set of m data points to be clustered and is a set of cluster labels.

3.1. Partitions and their Representations

Partitions usually occur in two forms, in a labeled and in an unlabeled form, where labeled partitions can be regarded as representations of unlabeled partitions.

We begin with describing labeled partitions. Let denote the vector of all ones. Consider the set

of matrices with elements from the unit interval and whose columns sum to one. A matrix represents a labeled (soft) partition of Z. The elements of X = () describe the degree of membership of data point to the cluster with label . The columns of X summarize the membership values of the data points across all clusters. The rows of X represent the clusters .

Next, we describe unlabeled partitions. Observe that the rows of a labeled partition X describe a cluster structure. Permuting the rows of X results in a labeled partition with the same cluster structure but with a possibly different labeling of the clusters. In cluster analysis, the particular labeling of the clusters is usually meaningless. What matters is the abstract cluster structure represented by a labeled partition. Since there is no natural labeling of the clusters, we define the corresponding unlabeled partition as the equivalence class of all labeled partitions that can be obtained from one another by relabeling the clusters. Formally, an unlabeled partition is a set of the form

where Πis the set of all ()-permutation matrices.

In the following, we briefly call X a partition instead of unlabeled partition. In addition, any labeled partition is called a representation of partition X. By P we denote the set of all (unlabeled) partitions with clusters over m data points. Since some clusters may be empty, the set P also contains partitions with less than clusters. Thus, we consider as the maximum number of clusters we encounter.

A is a partition whose matrix representations take only binary membership values from {0, 1}. By we denote the subset of all hard partitions. Note that the columns of representations of hard partitions are standard basis vectors from .

Though we are only interested in unlabeled partitions, we still need labeled partitions for two reasons: (1) computers can not easily and efficiently cope with unlabeled partitions unless the clusters carry labels in terms of number or names; and (2) using labeled partitions considerably simplifies derivation of theoretical results.

3.2. Intrinsic Metric

We endow the set P of partitions with an intrinsic metric induced by the Euclidean norm such that () becomes a geodesic space. The Euclidean norm for matrices is defined by

The norm is also known as the Frobenius or Schur norm. We call Euclidean norm in order to emphasize the geometric properties of the partition space. The Euclidean norm induces the distance function

for all partitions . Then the pair () is a geodesic metric space [24], Theorem 2.1. Suppose that X and Y are two partitions. Then

for all representations and . For some pairs of representations and equality holds in Eq. (1). In this case, we say that representations and are in optimal position. Note that pairs of representations in optimal position are not uniquely determined.

3.3. Fr´echet Functions

We first formalize the mean partition approach using Fr´echet functions. Then we present the Mean Partition Theorem, which is of pivotal importance for gaining deeper insight into the theory of the mean partition approach [27]. Here, we apply the Mean Partition Theorem to define the concept of majority vote. In addition, the proof of the proposed theorem resorts to the properties stated in the Mean Partition Theorem.

Let () be a partition space endowed with the metric induced by the Euclidean norm. We assume that Q is a probability dsitribution on P with support Suppose that = () is a sample of n partitions drawn i.i.d. from the

probability distribution Q. Then the Fr´echet function of is of the form

A mean partition of sample is any partition satisfying

Note that a mean partition needs not to be a member of the support. In addition, a mean partition exists but is not unique, in general [24].

The Mean Partition Theorem proved in [27] states that any representation M of a local minimum M of is the standard mean of sample representations in optimal position with M.

Theorem 1. Let = (be a sample of n partitions. Suppose that is a local minimum of the Fr´echet function ) of . Then every representation M of M is of the form

Condorcet’s original theorem is an asymptotical statement about the majority vote. To adopt this statement, we introduce the notion of expected partition. An expected partition of probability distribution Q is any partition that minimizes the expected Fr´echet function

As for the sample Fr´echet function , the minimum of the expected Fr´echet function exists but but is not unique, in general [24].

4. Condorcet’s Jury Theorem

This section extends Condorcet’s Jury Theorem to the partition space defined in Section 3.2.

4.1. The General Setting

Theorem 2 extends Condorcet’s Jury Theorem for hard partitions. Generalization to arbitrary partitions is out of scope and left for future research.

The general setting of Theorem 2 is as follows: Let = () be a sample of n hard partitions drawn i.i.d. from a probability distribution Q. Each of the sample partitions has a vote on a given data point with probability ) of being correct. The goal is to reach a final decision on data point z by majority vote. Theorem 2 makes an asymptotic statement about the correctness of the majority vote given the probabilities .

To formulate Theorem 2, we need to define the concepts of vote and majority vote. The majority vote is based on the mean partition of a sample and is not necessarily a hard partition. Since the mean partition itself votes, we introduce votes for arbitrary (soft and hard) partitions and later restrict ourselves to samples of hard partitions when defining the majority vote.

Assumption. In the following, we assume existence of a possibly unknown but unique hard ground-truth partition . By we denote an arbitrarily selected but fixed representation of .

4.2. Votes

We model the vote of a partition on a given data point . The vote of X on z has two possible outcomes: The vote is correct if X agrees on z with the ground-truth , and the vote is wrong otherwise. To model the vote of a partition, we need to specify what we mean by agreeing on a data-point with the ground-truth. An agreement function of representation X of X is a function of the form

where and are the j-th columns of the representations X and , respectively. A column of a matrix represents the membership values of the corresponding data point across all clusters. Then the value ) measures how strongly representation X agrees with the ground-truth on data point . If X is a hard partition, then ) = 1 if z occurs in the same cluster of X and , and ) = 0 otherwise.

The vote of representation X of partition X on data point z is defined by

where I {b} is the indicator function that gives 1 if the boolean expression b is true, and 0 otherwise. Observe that for hard partitions .

Based on the vote of a representation we can define the vote of a partition. The vote of partition is a Bernoulli distributed random variable. We randomly select a representation X of partition X in optimal position with . Then the vote ) of X on data point z is ). By

we denote the probability of a correct vote of partition X on data point z. Note that the probability ) is independent of the particular choice of representation of the ground-truth partition .

4.3. Majority Vote

We assume that = () is a sample of n hard partitions drawn i.i.d. from a cluster ensemble. We define a majority vote ) of sample on z as follows: We randomly select a mean partition M of and then set the majority vote ) on z to the vote ) of the chosen mean partition

It remains to show that the vote ) of any mean partition M of is indeed a majority vote. To see this, we invoke the Mean Partition Theorem. Any representation M of mean partition M is of the form

where are representations in optimal position with M. For a given data point , the mean membership values are given by

where denotes the j-th column of representation . Since the columns of are standard basis vectors, the elements of the j-th column contain the relative frequencies with which data point occurs in cluster . Then the vote ) is correct if and only if the agreement function of M satisfies

This in turn implies that there is a majority 5 for some cluster , because is a hard partition by assumption.

4.4. Condorcet’s Jury Theorem

Roughly, Condorcet’s Jury Theorem states that the majority vote tends to be correct when the individual voters are independent and competent. In consensus clustering, the majority vote is based on mean partitions. Individual sample partitions are competent on data point if the probability of a correct vote on z is given by 5. In the spirit of Condorcet’s Jury Theorem, we want to show that the probability ) = 1) of the majority vote ) tends to one with increasing sample size n.

In general, mean partitions are neither unique nor converge to a unique expected partition. This in turn may result in a non-convergent sequence (of majority votes for a given data points z. In this case, it is not possible to establish convergence in probability to the ground-truth. To cope with this problem, we demand that the sample partitions are all contained in a sufficiently small ball, called asymmetry ball. The of partition is the subset of the form

where is the degree of asymmetry of Z defined by

A partition Z is asymmetric if 0. If = 0 the partition Z is called symmetric. Any partition whose representations have mutually distinct rows is an asymmetric partition. Conversely, a partition is symmetric if it has a representation with at least two identical rows. We refer to [26] for more details on asymmetric partitions.

By we denote the largest open subset of . If Z is symmetric, then = be definition. Thus, a non-empty set entails that Z is symmetric.

A probability distribution Q is homogeneous if there is a partition Z such that the support of probability distribution Q is contained in the asymmetry ball . A sample is said to be homogeneous if the sample partitions of are drawn from a homogeneous distribution Q.

Now we are in the position to present Condorcet’s Jury Theorem for the mean partition approach. For a proof we refer to the appendix.

Theorem 2 (Condorcet’s Jury Theorem). Let Q be a probability measure on with support . Suppose the following assumptions hold:

1. There is a partition such that and .

2. Hard partitions are drawn i.i.d. according to Q.

3. Let . Then ) is constant for all .

Equation (2) corresponds to Condorcet’s original theorem for majority vote on a single data point and Eq. (3) shows that the sequence of mean partitions converges almost surely to the ground-truth partition. Observe that almost sure convergence in Eq. (3) also holds when the probabilities differ for different data points . From the proof of Condorcet’s Jury Theorem follows that the ground-truth partition is an expected partition almost surely and therefore takes the form as described in the Expected Partition Theorem [27].

5. The Role of Diversity

This section discusses the role of diversity under the assumptions of a special setting. We question the prevailing claim that quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partition set is necessary to control the quality.

5.1. Diversity and Quality

This section specifies the notion of diversity and quality for the subsequent analysis.

Suppose that S is either a sample or a set consisting of n elements . There are two common approaches to measure the diversity of S. Both approaches assume a dissimilarity function ∆ : . Typical choices for ∆ are based on the adjusted Rand index, the Jaccard index, and the Normalized Mutual Information. The first approach averages the pairwise dissimilarities

and the second approach is the variation

where M is either a medoid or mean partition of S. A simple calculation shows that both diversity measures are closely related by the inequality

For the following analysis, we assume that ∆ = . In this case, the diversity of a homogeneous sample (set) S is bounded by

where is the degree of asymmetry of a partition Z whose asymmetry ball includes S. We say, a sample (set) is diverse if it is not homogeneous.

Let be the ground-truth partition to be predicted as good as possible. For this, we need to specify the quality of a prediction, that is what we exactly mean by the term ”as good as possible”. We introduce the loss L(X) = ) of a partition X as a

measure of how well X predicts the ground-truth .

To measure the quality of consensus clustering, we assume that is a sample

is the worst-case loss of S. Consensus clustering generates a sample of sample partitions. The sample determines the non-empty and finite set of mean partitions. A mean algorithm picks a mean partition M from to predict the ground-truth . We define the quality of consensus clustering by the worst-case loss ). We consider the worst-case loss rather than the average or expected loss to keep the subsequent analysis simple.

5.2. The Role of Diversity

Diversity of the sample partitions is a basic prerequisite for consensus clustering. Without diversity, consensus clustering reproduces the same partition. In this case, the resulting mean partition is merely another replication.

Research on diversity in consensus clustering assumes that the quality of consensus clustering depends on the diversity of the sample. However, there is a significant gap in argument between the fact that diversity is a basic prerequisite to make sense of consensus clustering and the assumption that quality depends on diversity. The hope is that this gap can be bridged by an appropriate but currently unknown diversity measure. The proof of Theorem 2 rejects this hope.

Theorem 2 includes the assumptions of Condorcet’s original theorem and adds two further assumptions: (1) existence of a unique ground-truth partition, and (2) existence of a sufficiently small ball that contains the ground-truth partition as well as any sample partition.

The first assumption transfers the notion of correct decision in Condorcet’s original formulation to the context of consensus clustering. The second assumption demands homogeneity of the probability measure. At first glance, homogeneity seems to contradict the assumption that diversity of the sample is a decisive factor for improving the quality of consensus clustering.

A closer look at the proof of Theorem 2 shows that neither homogeneity nor diversity of the sample matters. What matters is that the mean partition is a consistent estimator of the ground-truth. Homogeneous probability distributions merely form one among several other classes of distributions for which consistency of the mean partition to a unique expected partition is ensured. Diversity of the sample can entail both, the desired consistency properties as well as diversity of the mean and expected partitions. In the latter case, Theorem 2 fails and the mean partition returned by the consensus clustering method may have nothing in common with the ground-truth.

The claim is that diversity of the sample partitions is neither necessary nor sufficient for finding mean partitions close to the ground-truth, whereas limiting the diversity of the mean partition set is necessary but not sufficient to control the quality of consensus clustering.

5.3. Loss Decomposition

Inspired by the bias-variance decomposition in supervised learning, we discuss the different sources of error in consensus clustering to understand how we can improve its quality. For this, we call a cluster ensemble stable (unstable) if its mean partition set is homogeneous (diverse). Suppose that is a mean partition set. Then

is the approximation error of F. The approximation error tells us how close the best solution we can attain is to the ground-truth. Then the worst-case loss ) of a mean partition set can be expressed as

The approximation error is caused by the choice and design of the underlying clustering ensemble method. The estimation error is caused by the random process of generating partitions. This error measures the difference between the loss in the worst and best case. If the set of mean partitions is homogeneous (diverse), then the estimation error is small (large).

For homogeneous distributions, the set is a singleton. Therefore, the estimation error is always zero, but the approximation error is likely to be large, when prior knowledge about the data is not considered. More generally, stable approaches guarantee a small estimation error and thereby impose a strong bias about the kind of cluster structure the underlying ensemble method is looking for. If the assumption on the cluster structure does not match the ground-truth, the approximation error will be large. The usual way to improve stable approaches is to incorporate competence by means of prior knowledge and domain expertise in order to hopefully reduce the approximation error. Under the special assumptions of Theorem 2, the estimation error is zero and the worst-case loss ) converges almost surely to zero as the sample size n tends to infinity.

In contrast, unstable approaches result in large estimation errors, but may more likely have a lower approximation error than stable approaches. Such approaches are less biased about the kind of cluster structure the underlying ensemble method is looking for. Unstable approaches shift the problem of assuming a certain kind of cluster structure in the data to the problem of identifying a mean partition close to the ground-truth. In other words, the bias on the kind of cluster structure is shifted to a bias on the choice of mean partition. The problem is that existing mean algorithms are unable to cope with the latter bias, because the Fr´echet function contains no information about the ground-truth. Consequently, a mean algorithm regards the diverse sample means as equivalent solutions of the same minimization problem. Thus, obtaining a mean partition with low loss is merely due to chance.

Another issue is that unstable approaches are prone to the Texas sharpshooter fallacy: Suppose a data set Z is given for which the ground truth is known. Repeatedly apply a consensus clustering method to Z. In each trial, consensus clustering generates sample partitions and returns a mean partition. The mean partitions of the different trials can differ substantially. In this case, select the top N mean partitions closest to the known ground-truth and claim that diversity improves quality.

Finally, it is unclear how to improve the quality of unstable approaches without turning them into stable versions. A possible solution could be to derive the mean partitions of the second order, that is the mean partitions of the mean partition set. If the mean partition set is diverse, it can happen that mean partition sets of higher order remain diverse. Empirical results could shed light on this issue but require enumerating a subset of mean partitions, which is computationally expensive. Also incorporating prior knowledge will not improve the quality of unstable approaches, because the mean partitions are diverse.

Though we analyzed the relationship between diversity and quality in a special setting, we assume that the main findings carry over to other consensus clustering approaches as well as other quality and diversity measures, unless they address the key issue of non-uniqueness of the mean and expected partition. We therefore conjecture that improved results obtained by unstable approaches are due to chance and not any systematic factor such as diversity of the sample partitions. To improve quality, we advocate to limit the diversity of the mean partitions and to incorporate prior knowledge.

6. Conclusion

This contribution extends Condorcet’s Jury Theorem to partition spaces endowed with a metric induced by the Euclidean norm under the following additional assumptions: (i) existence of a hard ground-truth partition, and (ii) all sample partitions and the ground-truth are contained in some asymmetry ball. In light of the proposed theorem and its assumptions, we show that the quality of consensus clustering is independent of the diversity of the sample partitions. To improve quality, we advocate to limit diversity of the mean partitions to keep the estimation error low and incorporate prior knowledge to reduce the approximation error. We regard this finding as helpful for devising better consensus clustering algorithms. Future research aims at extending the proposed theorem by relaxing assumptions (i) and (ii) and at empirically investigating the conjecture on diversity of the mean partitions.

A. Proof of Theorem 2

To prove Theorem 2, it is helpful to use a suitable representation of partitions. We suggest to represent partitions as points of some geometric space, called orbit space [24]. Orbit spaces are well explored, possess a rich geometrical structure and have a natural connection to Euclidean spaces [3, 23, 38].

A.1. Partition Spaces

The group Π = Π)-permutation matrices is a discontinuous group that acts on X by matrix multiplication, that is

The orbit of is the set [. The orbit space of partitions is the quotient space obtained by the action of the permutation group Π on the set X . We write P = X /Π to denote the partition space and to denote an orbit [Π. The natural projection sends matrices X to the partitions ] they represent. The partition space P is endowed with the intrinsic metric defined by

A.2. Dirichlet Fundamental Domains

We use the following notations: ByU we denote the closure of a subset boundary of the open subset. The action of permutation subset is the set defined by we denote the subset of ()-permutation matrices without identity matrix I.

A subset F of X is a fundamental set for Π if and only if F contains exactly one representation X from each orbit [Π. A fundamental domain of Π in X is a closed connected set that satisfies

be a Dirichlet fundamental domain of representation Z of an asymmetric partition . Suppose that are two different representations of a partition X such that

A.3. Multiple Alignments

Let ) be a sample of n partitions . A multiple alignment of ) consisting of representations

we denote the set of all multiple alignments of . A multiple alignment is said to be in optimal position with representation Z of a partition Z, if all representations are in optimal position with Z. The mean of a multiple alignment is denoted by

The problem of finding an optimal multiple alignment is that of finding a multiple alignment with smallest average pairwise squared distances in X . To show equivalence between mean partitions and an optimal multiple alignments, we introduce the sets of minimizers of the respective functions

For a given sample ) is the mean partition set and ) is the set of all optimal multiple alignments. The next result shows that any solution of is also a solution of and vice versa.

A.4. Proof of Theorem 2

1 Without loss of generality, we pick a representation of the ground-truth partition Let Z be a representation of Z in optimal position with

3 From [26], Theorem 3.1 follows that the mean partition is unique. We show that . Suppose that ) is a multiple alignment in optimal position with is a bijective isometry, we have

is a representation of a mean partition is convex, we find that and therefore

4 From Part 1–3 of this proof follows that the multiple alignment X is in optimal position with We show that there is no other multiple alignment of with this property. Observe that is contained in the Dirichlet fundamental domain of representation ) be a representation of the support in . Then by assumption, we have showing that lies in the interior of . From the definition of a fundamental domain together with Lemma 1 follows that X is the unique optimal alignment in optimal position with

5 With the same argumentation as in the previous part of this proof, we find that M is the unique representation of M in optimal position with

be a data point. Since is the unique representation in optimal position with , the vote of on data point z is of the form With the same argument, we have

) we denote the column of that represents z. By definition, we have

for all are both hard partitions, we find that

Thus, the agreement ) counts the fraction of sample partitions that correctly classify

denote the probability that the majority of the sample partitions correctly classifies z. Since the votes of the sample partitions are assumed to be independent, we can compute using the binomial distribution

where is the largest integer . Then the assertion of Eq. (2) follows from [17], Theorem 1.

9 We show the assertion of Eq. (3). By assumption, the support is contained in an open subset of the asymmetry ball . From [26], Theorem 3.1 follows that the expected partition is unique. Then the sequence (converges almost surely to the expected partition according to [24], Theorem 3.1 and Theorem 3.3. From the first eight parts of the proof follows that the limit partition agrees on any data point z almost surely with the ground-truth partition . This shows the assertion.

References References

[1] D. Berend and J. Paroush. When is Condorcet’s Jury Theorem valid? Social Choice and Welfare, 15(4):481–488, 1998. [2] A. Bhattacharya and R. Bhattacharya. Nonparametric Inference on Manifolds with Applications to Shape Spaces. Cambridge University Press, 2012. [3] G. E. Bredon. Introduction to Compact Transformation Groups. Elsevier, 1972. [4] N.C. de Condorcet. Essai sur l’application de l’analyse `a la probabilit´e des d´ecisions rendues `a la pluralit´e des voix. Imprimerie Royale, Paris, 1785. [5] E. Dimitriadou, A. Weingessel, and K. Hornik. A Combination Scheme for Fuzzy Clus- tering. Advances in Soft Computing, 2002.

[10] X.Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128–141, 2008. [11] V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. Interna-

[12] L. Franek and X. Jiang. Ensemble clustering by means of clustering embedding in vector

[14] C.E. Ginestet. Strong Consistency of Fr´echet Sample Mean Sets for Graph-Valued Ran- dom Variables. arXiv: 1204.3183, 2012. [15] R. Ghaemi, N. Sulaiman, H. Ibrahim, and N. Mustapha. A Survey: Clustering Ensembles

[16] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on

[17] B. Grofman, G. Owen, and S.L. Feld. Thirteen theorems in search of the truth. Theory

[18] S.T. Hadjitodorov, L.I. Kuncheva, and L.P. Todorova. Moderate diversity for better

[19] J. Hu, T. Li, H. Wang, and H. Fujita. Hierarchical cluster ensemble model based on

[20] S. Huckemann, T. Hotz, and A. Munk. Intrinsic shape analysis: Geodesic PCA for

[21] N. Iam-On, T. Boongoen, S. Garrett, and C. Price. A link-based approach to the cluster

[22] B.J. Jain and K. Obermayer. Structure Spaces. Journal of Machine Learning Research,

[23] B.J. Jain. Geometry of Graph Edit Distance Spaces. arXiv: 1505.08071, 2015. [24] B.J. Jain. Asymptotic Behavior of Mean Partitions in Consensus Clustering.

[27] B.J. Jain. The Mean Partition Theorem of Consensus Clustering. arXiv:1604.06626,

2016. [28] D.G. Kendall. Shape manifolds, Procrustean metrics, and complex projective spaces.

[29] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and active learn-

[30] L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley &

[31] L.I. Kuncheva and S.T. Hadjitodorov. Using diversity in cluster ensembles. IEEE International Conference on Systems, Man and Cybernetics, 2:1214–1219, 2004.

[32] L. Lam and C.Y. Suen. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man and Cybernetics – Part A: Systems and Humans, 27(5):553-568, 1997.

[33] T. Li, C. Ding and M.I. Jordan. Solving consensus and semi-supervised clustering prob-

[34] T. Li and C. Ding. Weighted consensus clustering. SIAM International Conference on

[35] J.S. Marron and A.M. Alonso. Overview of object oriented data analysis. Biometrical Journal, 56(5):732–753, 2014.

[36] M. Pividori, G. Stegmayer, and D. Milone. Diversity control for improving the analysis of consensus clustering. Information Sciences, 361–362:120–134, 2016.

[39] L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010. [40] A. Strehl and J. Ghosh. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583–617, 2002.

[42] A.P. Topchy, A.K. Jain, and W. Punch. Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions in Pattern Analysis and Machine Intelligence, 27(12):1866–1881, 2005.

[43] S. Vega-Pons, J. Correa-Morris and J. Ruiz-Shulcloper. Weighted partition consensus via kernels. Pattern Recognition, 43(8):2712–2724, 2010.

[44] S. Vega-Pons and J. Ruiz-Shulcloper. A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25(03):337-372, 2011.

[45] J. Waldron. The Wisdom of the Multitude: Some Reflections on Book III Chapter 11 of the Politics. Political Theory 23:563–84, 1995.

[46] H. Wang and J.S. Marron. Object oriented data analysis: sets of trees. The Annals of Statistics 35:1849–1873, 2007.

[47] F. Yang, X. Li, Q. Li, and T. Li. Exploring the diversity in cluster ensemble gener- ation: Random sampling and random projection. Expert Systems with Applications, 41(10):4844–4866, 2014.

[48] Z. Zhou. Ensemble Methods: Foundations and Algorithms. Taylor & Francis Group, LLC, 2012.

designed for accessibility and to further open science