- 2 (equal)
- 3 (equal)
- 2 (10/90%)
- 2 (equal)
- 3 (equal)
- 4 (equal)
- 2 (equal)
- 3 (equal)
- 4 (equal)
Cluster centroid separation
Using the ground truth cluster membership, we computed the distance Δ between simulated
datasets’ cluster centroids (after projection into two dimensions) as the Euclidean distance between
the average positions of observations within each cluster in reduced space. For three-cluster
datasets, we computed the distance between the “middle” cluster and another cluster’s centroid to obtain the smallest between-cluster distance.
proportion of different features after dimensionality reduction through MDS (Figure 3) and UMAP
(Figure 4). The visualised datasets also differed in number of clusters (two unequally sized, two
equally sized, or three equally sized), and covariance structure (no covariance, random covariance, or different covariances between clusters), adding up to 180 datasets per figure.
negligible for MDS. Overall, cluster separation increases as a function of higher differences
between clusters within each feature, and the proportion of different features. The same is true for UMAP, although its outcomes are non-linear and more variable.
increasing within-feature effect size and proportion of different features, whereas UMAP shows
improved separation only at large differences (Cohen’s d = 2.1 within each feature) or at large
proportions of different features. Crucially, clusters with different covariance structures (3-factor
and 4 factor; of 3-factor, 4-factor, and random) but similar mean vectors do not show clear
separation.
Figure 3 – Each cell presents the cluster centroid separation Δ (brighter colours indicate stronger
separation) after multi-dimensional scaling (MDS) was applied to simulated data of 1000
observations and 15 features. Separation is shown as a function of within-feature effect size
(Cohen’s d, x-axis), and the proportion of features that were different between clusters. Each row
shows a different covariance structure: “mixed” indicates subgroups with different covariance
structures, “random” with the same random covariance structure between all groups, and “no” for no correlation between any of the features). Each column shows a different type of population: with
unequal (10 and 90%) subgroups, with two equally sized subgroups, and with three equally sized
subgroups.
Figure 4 – Each cell presents the cluster centroid separation Δ (brighter colours indicate stronger separation) after uniform manifold approximation and projection (UMAP) was applied to simulated
data of 1000 observations and 15 features. Separation is shown as a function of within-feature
effect size (Cohen’s d, x-axis), and the proportion of features that were different between clusters.
Each row shows a different covariance structure: “mixed” indicates subgroups with different
covariance structures, “random” with the same random covariance structure between all groups, and “no” for no correlation between any of the features). Each column shows a different type of
population: with unequal (10 and 90%) subgroups, with two equally sized subgroups, and with
three equally sized subgroups.
Membership classification (adjusted Rand index)
Overlap between clustering outcomes and ground truth was computed with the adjusted Rand index. Figure 5 shows the effects of within-feature effect size, number of different features, dimensionality
reduction algorithm, and cluster algorithm on the adjusted Rand index for simulated datasets with
two equally sized subgroups with different covariance structures (3-factor and 4-factor). These
simulated datasets were the optimal example due to their differentiation into two equally sized
groups, and their realistic difference in covariance structure.
regardless of dimensionality reduction. The same is true for the two versions of agglomerative
clustering (Ward linkage with Euclidean distance, or average linkage with cosine distance). By
contrast, HDBSCAN performs well only after UMAP dimensionality reduction. It is likely that this
is due to the algorithm only assigning the denser centres to their respective clusters, while leaving
many other observations unassigned.
Figure 5 – Each cell shows the adjusted Rand index (brighter colours indicate better classification) as a function of within-feature effect size (Cohen’s d), and the proportion of features that differed between two simulated clusters with different covariance structures (3-factor and 4-factor). Each
row presents a different dimensionality reduction approach: None, multi-dimensional scaling
(MDS), or uniform manifold approximation and projection (UMAP). Each column presents a
different type of clustering algorithm: k-means, agglomerative (hierarchical) clustering with Ward
linkage and Euclidean distance, agglomerative clustering with average linkage and cosine
distance, and HDBSCAN.
Data subgrouping (silhouette coefficient)
Silhouette scores reflect the information that would have been available to a researcher to decide on whether discrete subgroups were present in the simulated datasets (i.e. without knowing the ground
truth). Figure 6 shows the effects of within-feature effect size, number of different features,
dimensionality reduction algorithm, and cluster algorithm on the silhouette scores for simulated
datasets two equally sized subgroups with different covariance structures (3-factor and 4-factor).
Unlike the adjusted Rand index, for which the ground truth needs to be known, silhouette scores are
impacted by dimensionality reduction. Using the traditional threshold of 0.5, none of the raw
datasets would have been correctly identified as clustered.
feature difference with a Cohen’s d of 2.1 would have been correctly identified as showing
clustering through k-means and the two agglomerative clustering approaches. On the basis of
HDBSCAN, a researcher would have correctly identified clustering in datasets with two-thirds or
more features showing a difference with a Cohen’s d of 1.3, or one-third or more features showing a difference with a Cohen’s d of 2.1.
would correctly identify the clustered nature of datasets with two-thirds of features showing
differences of 0.8, or one-third or more showing differences of 1.3 or over. The same is true for
HDBSCAN, which was also able to identify clustering when all features showed a within-feature
difference corresponding to Cohen’s d values of 0.3.
Figure 6 – Each cell shows the silhouette coefficient (brighter colours indicate stronger detected clustering, with a threshold set at 0.5) as a function of within-feature effect size (Cohen’s d), and the
proportion of features that differed between two simulated clusters with different covariance
structures (3-factor and 4-factor). Each row presents a different dimensionality reduction
approach: None, multi-dimensional scaling (MDS), or uniform manifold approximation and
projection (UMAP). Each column presents a different type of clustering algorithm: k-means,
agglomerative (hierarchical) clustering with Ward linkage and Euclidean distance, agglomerative clustering with average linkage and cosine distance, and HDBSCAN.
regardless of method. Dimensionality reduction, particularly the non-linear projection provided by UMAP, helped elevate cluster coefficients across the board. Only datasets with large within-feature effect sizes and a high number of features showing differences were correctly marked as “clustered” using the traditional silhouette score threshold of 0.5.
Effect of dimensionality reduction on cluster separation
As expected, simulated sample centroid distances (Figure 7, in green) were aligned with subgroup
centroids before dimensionality reduction was applied, with minor random sampling error.
Dimensionality reduction did impact cluster separation. MDS subtly exaggerated centroid distances across all centroid separations in original space (Figure 7, in blue). UMAP reduced sample centroid distance at lower (Δ < 3) and increased it at higher (Δ > 4) centroid distances (Figure 7, in purple).
The only exception to this was the simulated dataset with three clusters of respectively 3-factor, 4-
factor, and no covariance structure, where UMAP reduced centroid distance for all original centroid distances (it is unclear why, and likely due to chance).
Figure 7 – Each dot represents a simulated dataset of 1000 observations and 15 features, each with
different within-feature effect size, proportion of different features between clusters, covariance
structure, and number of clusters. The x-axis represents the separation between subgroups in the
population that datasets were simulated from, and the y-axis represents the separation in the
simulated dataset after no dimensionality reduction (green), multi-dimensional scaling (MDS,
blue), or uniform manifold approximation and projection (purple). The dotted line indicates no
difference between population and sample; values above the dotted line have had their separation increased, and it was decreased for those below the line.
was likely to increase separation, and UMAP was likely to increase it dramatically when the
original separation was large enough. Due to the stochastic nature of the algorithms, the size of their effects on separation is variable.
Statistical power and accuracy
We opted to not present the outcomes of agglomerative clustering, because they align closely with
k-means. In addition, because the previous analyses showed that subgroup centroid separation was
the main factor in determining cluster analysis outcomes, we opted for simulating datasets in
reduced space (multivariate normal with two features).
being a clustered structure to the data, which occurred when silhouette scores were 0.5 or over. The
probability of selecting the correct number of clusters was computed as the proportion of solutions
where the silhouette score was highest for the true number of clusters. Finally, classification
accuracy was computed as the proportion of observations that was correctly assigned to their
respective cluster.
K-means
For k-means, power to detect clustering was primarily dependent on cluster separation, and much
less on sample size (Table 2; Figure 8, top and second row). At cluster separation Δ=5, there was
71% power to detect clustering in a population divided into one large (90%) and one small (10%)
subgroup at sample size N=10, and 92% at N=20. For two equally sized clusters, power was 82%
from separation Δ=4 at N=10, and higher for larger sample and effect sizes. For three equally sized
clusters, power was 76% at separation Δ=4 for N=10, 69% for N=20, 77% to N=40, and over 80
from N=80; with power for larger effect sizes around 100%. For four equally sized clusters, power
was 75% at separation Δ=4 at N=80, 85% at Δ=5 for N=10, and around 100% for larger effect and
sample sizes. (See Figure 8 and Table 2 for all quoted numbers.)
of clusters from separation Δ=4. For equally sized clusters, that level of accuracy was also reached at separation Δ=3 when the sample size was about 20 per subgroup (Figure 8, third row).
chance for all tested separation values and sample sizes in populations with equally sized
subgroups, and above 80-90% from separation Δ=3. Classification accuracy was above chance for populations with one small (10%) and one large (90%) subgroup from N=80 and separation Δ=4 (Figure 8, bottom row).
subgroups with k-means, provided cluster separation was Δ=4 or over, and subgroups were roughly equally sized (detecting smaller subgroups among large subgroups was only possible for separations
of Δ=5 or over). These values also provided near-perfect accuracy for the detection of the true
number of clusters, and very high (90-100%) classification accuracy of individual observation’s
group membership.
Figure 8 – K-means silhouette scores (top row), proportion of correctly identified clustering
(second row), proportion of correctly identified number of clusters (third row), and the proportion
of observations correctly assigned to their subgroup, each computed through 100 iterations of
simulation. Datasets of varying sample size (x-axis) and two features were sampled from
populations with equidistant subgroups that each had the same 3-factor covariance structure. The
simulated populations were made up of two unequally sized (10 and 90%) subgroups (left column); or two, three, or four equally sized subgroups (second, third, and fourth column, respectively).
Table 2
Statistical power for the binary decision of data being “clustered” using k-means clustering.
Estimates based on 100 iterations per cell, using a decision threshold of 0.5 for silhouette scores.
2 Clusters
(10/90%)
2 Clusters
(50/50%)
3 Clusters
(33/34/33%)
4 Clusters
(25/25/25/25%)
HDBSCAN
For HDBSCAN, power to detect clustering was primarily dependent on effect size, provided the
sample size was over a threshold (Table 3; Figure 9, top and second row). For a population divided into one large (90%) and one small (10%) cluster, power was 84% at N=80 for separation Δ=6. For two clusters of equal size, power was 66% at N=40 and 83% for N=80 for separation Δ=3. For three
clusters of equal size, power was 66% at N=160 for separation Δ=3, and 84% at N=80 for
separation Δ=4. For four clusters of equal size, power was 75% at N=80 for separation Δ=4. (See
Figure 9 and Table 3 for all quoted numbers.)
that HDBSCAN detected the correct number of subgroups. This, too, was strongly dependent on
cluster separation. Accuracy was highest for separations of Δ=5 or over, and mostly acceptable
(around 70-80%) at separations Δ=4. For equally sized clusters, this was true from N=40, while the accurate detection of one small (10%) and one large (90%) subgroup required N=80 or over.
only over chance at from separations of Δ=8 and sample sizes of N=80 for populations with one
small (10%) and one large (90%) subgroup (chance = 90%). It was over chance from Δ=4 and
N=40 for populations with two equally sized subgroups (chance = 50%), from Δ=4 and N=40 for
populations with three equally sized subgroups (chance = 33%), and from Δ=3 and N=40 for
populations with four equally sized subgroups (chance = 25%).
of subgroups with HDBSCAN, provided cluster separation was Δ=4 or over, and subgroups were
roughly equally sized (detecting smaller subgroups among large subgroups was only possible for
separations of Δ=6 or over). These values also provided reasonable accuracy for the detection of the true number of clusters, as well as over-chance classification accuracy of individual observation’s group membership.
Figure 9 – HDBSCAN silhouette scores (top row), proportion of correctly identified clustering
(second row), proportion of correctly identified number of clusters (third row), and the proportion
of observations correctly assigned to their subgroup, each computed through 100 iterations of
simulation. Datasets of varying sample size (x-axis) and two features were sampled from
populations with equidistant subgroups that each had the same 3-factor covariance structure. The simulated populations were made up of two unequally sized (10 and 90%) subgroups (left column); or two, three, or four equally sized subgroups (second, third, and fourth column, respectively).
Table 3
Statistical power for the binary decision of data being “clustered” using HDBSCAN clustering.
Estimates based on 100 iterations per cell, using a decision threshold of 0.5 for silhouette scores.
2 Clusters
(10/90%)
2 Clusters
(50/50%)
3 Clusters
(33/34/33%)
4 Clusters
(25/25/25/25%)
C-means
As for k-means and HDBSCAN, for c-means power to detect clustering was primarily dependent on
cluster separation, and much less on sample size (Table 4; Figure 10, top and second row). At
cluster separation Δ=5, there was 81% power to detect clustering in a population divided into one large (90%) and one small (10%) subgroup at sample size N=20, and 95-100% at larger effect and
sample sizes. For two equally sized clusters, power was 77% for N=10 and 82% and N=20 for
and effect sizes. For four equally sized clusters, power was 75% for N=40 and 83% at N=80 at
separation Δ=3, 94% for N=20 at Δ=4, and around 100% for larger effect and sample sizes. (See
Figure 10 and Table 4 for all quoted numbers.)
of clusters from separation Δ=4. For equally sized clusters, that level of accuracy was also reached at separation Δ=3 when the sample size was about 20 per subgroup (Figure 10, third row).
chance for all tested separation values and sample sizes in populations with equally sized
subgroups, and above 80-90% from separation Δ=3. Classification accuracy was above chance for populations with one small (10%) and one large (90%) subgroup from N=40 and separation Δ=5 (Figure 10, bottom row).
subgroups with c-means, provided cluster separation was Δ=3 or over, and subgroups were roughly equally sized (detecting smaller subgroups among large subgroups was only possible for separations
of Δ=5 or over). These values also provided near-perfect accuracy for the detection of the true
number of clusters, and very high (90-100%) classification accuracy of individual observation’s
group membership.
Figure 10 – C-means silhouette scores (top row), proportion of correctly identified clustering
(second row), proportion of correctly identified number of clusters (third row), and the proportion
of observations correctly assigned to their subgroup, each computed through 100 iterations of
simulation. Datasets of varying sample size (x-axis) and two features were sampled from
populations with equidistant subgroups that each had the same 3-factor covariance structure. The
simulated populations were made up of two unequally sized (10 and 90%) subgroups (left column); or two, three, or four equally sized subgroups (second, third, and fourth column, respectively).
Table 4
Statistical power for the binary decision of data being “clustered” using k-means clustering.
Estimates based on 100 iterations per cell, using a decision threshold of 0.5 for silhouette scores.
2 Clusters
(10/90%)
2 Clusters
(50/50%)
3 Clusters
(33/34/33%)
4 Clusters
(25/25/25/25%)
Direct comparison of discrete and fuzzy clustering
The power results summarised above suggested c-means (80-100% power at Δ=3) is more powerful than k-means (70-100% power at Δ=4) for detecting equally sized clusters. K-means outcomes were
interpreted using the traditional silhouette score, whereas c-means outcomes were using the fuzzy
silhouette score. It could be that the latter inflated silhouette scores, which would make c-means
more likely to detect clustering at lower centroid separations (see Figure 11 for an example between Δ=3 and 4). However, such an inflation could have also increased the likelihood of false positives.
equal size (k=2 to 4) with a range of centroid separations (Δ=1 to 10). In addition to c-means, we
also employed Gaussian mixture modelling. This is a different approach than c-means clustering,
but it does result in a similar metric of membership confidence for each observation. This was used to compute the fuzzy silhouette coefficient in the same way as for c-means.
Crucially, the false positive rate (the probability of silhouette scores to surpass the 0.5 threshold
when no clustering was present) was 0% for k-means, c-means, and Gaussian mixture modelling
(Figure 12, left column; based on 100 iterations, each algorithm used on the same simulated data).
In sum, using non-discrete methods and the fuzzy cluster coefficient increased the likelihood of
cluster detection, but not the false positive rate.
score of 0.5 of higher) and cluster number accuracy (proportion of iterations in which the ground
truth number of clusters was correctly identified) as a function of cluster separation (effect size) in
simulated data. It confirms that c-means and Gaussian mixture modelling (100% power at Δ=3) are
more powerful than k-means (100% at Δ=4). C-means and mixture modelling also succeeded at
identifying the correct number of clusters at lower separations than k-means. Finally, estimating the number of clusters on the basis of the highest silhouette score was a better method than choosing on the basis of the lowest Bayesian Information Criterion (BIC) with Gaussian mixture modelling.
Figure 11 – Clustering outcomes of discrete k-means clustering (top row) and fuzzy c-means
clustering (bottom row). The top-left panel shows the ground truth, a sample (N=600) from a
simulated population made up of three equally sized and equidistant subgroups. The middle column shows the assignment of observations to clusters, and the right column shows the corresponding silhouette coefficients. The shading in the bottom silhouette plot indicates the discrete clustering
silhouette coefficients, and the coloured bars indicate a transformation equivalent to that
performed to compute the fuzzy silhouette score. The bottom-left shows a bunny (also fuzzy) amid contained and well-separated ellipsoidal clusters (photo by Tim Reckmann, licensed CC-BY 2.0).
Figure 12 – Traditional (top row), fuzzy (middle row and bottom row) silhouette scores,
respectively computed after clustering through k-means, c-means, and Gaussian finite mixture
modelling. Each line presents the mean and 95% confidence interval obtained over 100 iterations
of sampling (N=120, two uncorrelated features) from populations with no subgroups (left column),
two subgroups (second column), three subgroups (third column), or four subgroups (right column)
that were equidistant and of equal size. The same dataset in each iteration was subjected to k-
means, c-means, and Gaussian mixture modelling. Estimates were obtained for different centroid
separation values (Δ=1 to 10, in steps of 0.5; brighter colours indicate stronger separation). All
simulation results are plotted, and power (proportion of iterations that found k>1) and accuracy
(proportion of iterations that found the true number of clusters) are annotated for centroid
separations Δ=2 to 4 (see Figure 13 for power and accuracy as a function of centroid separation).
Figure 13 – Power (top row) and cluster number accuracy (bottom row) for k-means (left column), c-means (middle column), and Gaussian mixture modelling (right column). Power was computed as
the proportion of simulations in which the silhouette score was equal to or exceeded 0.5, the
threshold for subgroups being present in the data. Cluster number accuracy was computed as the
proportion of simulations in which the highest silhouette coefficient (solid lines) or the lowest
Bayesian Information Criterion (dashed line, only applicable in Gaussian mixture modelling) was
associated with the true simulated number of clusters. For each ground truth k=1 to k=4, 100
simulation iterations were run. In each iteration, all three methods were employed on the same
data, and all methods were run for guesses k=2 to k=7.
Ensuring adequate statistical power is essential to improve reliability and replicability of science
(27–29). Furthermore, the decision of whether subgroups exist in data can have important
theoretical and clinical consequences, for example when cluster analysis is used as a data-driven
approach to define diagnostic subgroups (35), or grouping patients in clinical practice (36). We
employed a simulation approach to determining statistical power for cluster analyses, i.e. the
probability of correctly accepting the hypothesis that subgroups exist in multivariate data. We
simulated subgroups as multivariate normal distributions that varied in number, relative size,
separation, and covariance structure. We also varied dimensionality reduction technique (none,
multi-dimensional scaling, and uniform manifold approximation and projection) and cluster-
detection algorithm (k-means, agglomerative hierarchical clustering, HDBSCAN, c-means, and
Gaussian mixture modelling).
reduction through multi-dimensional scaling (MDS) increases subgroup separation by about Δ=1,
whereas uniform manifold approximation and projection (UMAP) decreases separation when it was below Δ=4 in original space, or increase it when original separation was over Δ=5.
size and relative subgroup size had some effect. In populations with a small subgroup (10%), larger
separation was required to properly detect clustering. Furthermore, most algorithms performed
optimally at lower separations only with a minimum sample size of 20-30 observations per
subgroup, and thus a total sample size of that multiplied by the number of expected subgroups (k)
within a studied population.
traditional discrete methods. Specifically, c-means had sufficient power (80-100%) from a centroid separation of Δ=3, whereas that was only true from Δ=4 for k-means.
modelling. This is noteworthy, because latent profile analysis and latent class analysis (both
branches of finite mixture modelling) are increasingly popular among researchers (Figure 1). Our
results suggest that both fuzzy clustering and finite mixture modelling approaches are more
powerful than discrete clustering approaches.
Take-home messages
Practical example
This example summarises the above take-home measures in a hypothetical study. It could serve as
an example for a preregistration (or any other type of study planning or proposal).
sized subgroups. Between these subgroups, you expect small differences (Cohen’s d = 0.3) in 20
features, medium differences (Cohen’s d = 0.5) in 12, and large differences (Cohen’s d = 0.8) in 4.
With these differences among 36/100 measured features, total expected separation would be Δ=2.7 (computed with Equation 1).
be expected to increase to roughly Δ=4 in reduced (two-dimensional) space (according to Figure 7). This increased separation offers better power for a following cluster analysis.
algorithm (e.g. c-means) or a mixture modelling approach (e.g. latent profile or latent class analysis, depending on whether your data is continuous or binary), which both showed sufficient power at the expected separation of Δ=4 (according to Table 4, and Figures 10, 12, and 13).
power if you measure 20 to 30 observations per expected subgroup (according to Table 4 and Figure 10). You expected two equally sized subgroups, so your total sample size would be N=40 to N=60.
population, this would need to be accounted for in your sample size. Your sample should include at least 20 to 30 observations from the smaller subgroup. Assuming unbiased sampling (20% of your sample is part of the smaller subgroup in the population), your total sample should thus be N=100 to N=150. You can compute this as: (100% / subgroup percentage) * minimum number of samples.
Conclusion
Cluster algorithms have sufficient statistical power to detect subgroups (multivariate normal
distributions with different centres) only when these are sufficiently separated. Specifically, the
separation in standardised space (here named effect size Δ) should be at least 4 for k-means and
HDBSCAN to achieve over 80% power. Better power is observed for c-means and finite Gaussian mixture modelling, which achieved 80% power at Δ=3 without inflating the false positive rate.
Effect size Δ can be computed as the accumulation of expected within-feature effect sizes. While
covariance structure did not impact clustering, sample size did to some extent, particularly for
HDBSCAN and k-means. Sampling at least N=20 to 30 observations per expected subgroup
resulted in satisfactory statistical power.
Literature frequency comparison
Figure 1 in the Background section describes an increase in the relative frequency of publications
that mention several terms related to cluster analysis, compared to publications that mention
traditional statistics. The illustration was generated using Bibliobanana (17), which uses the NCBI
E-utilities API to query the publications that appeared in the PubMed database each year.
were “k-means” (k-means, blue line in Figure 1), “agglomerative clustering” OR “hierarchical
clustering” (hierarchical clustering, green line), “DBSCAN” OR “HDBSCAN” ((H)DBSCAN,
purple line), “c-means” (c-means, red line), "finite mixture modelling" OR "latent class analysis"
OR "latent profile analysis" (finite mixture modelling, orange line), and "t-SNE" OR "tSNE" OR
"UMAP" (t-SNE / UMAP, brown line). Comparison terms were “t-test”, “ANOVA”, and “linear
regression”. Clustering terms were divided by the average of all comparison terms to come to their relative frequency.
Simulation
All simulated datasets constituted multivariate normal distributions, specifically one for each
subgroup. We could define the covariance structure of each distribution, and distribution separation by setting their mean vectors. All standard deviations were set to 1.
features, with defined mean vectors (see below), and standard deviations of 1 within each feature.
The number of subgroups was 2 with unequal group size (10/90%), 2 with equal group size
(50/50%), or 3 with equal group size (33/34/33%). Within each simulation, within-feature
differences were generated with Cohen’s d values of 0.3, 0.5, 0.8, 1.3, or 2.1, and the number of
different features was 1, 5, 10, or 15.
intended Cohen’s d from one cluster, and adding the same value to the other within each feature. The order of addition and subtraction was shuffled within each feature. For three-cluster datasets, one cluster was assigned a mean vector of zeros (the “middle” cluster). The other two clusters had the intended Cohen’s d added or subtracted within each feature, again with shuffled order within each feature.
for all non-diagonal relations; uniform random values between -0.3 and 0.3; or with imposed 3 or 4 factor structure with uniform random values between -0.9 and -0.4 or between 0.4 and 0.9 within each factor, and between -0.3 and 0.3 for relations between variables in different factors. We also
included two types of datasets with different covariance structures between subgroups, one with
different underlying factor structures (3 and 4 factors; or 3 and 4 and no factors), and one simply
with different covariance structures (random and random; or random and random and no
covariance).
of 10, 20, 40, 80, or 160 observations and 2 (uncorrelated) features. These datasets were constructed
with two unequally sized (10/90%), two equally sized, three equally sized, or four equally sized
subgroups (multivariate normal distributions with standard deviations of one); all with equidistant
means at centroid separations of Δ=1 to 10. For each combination of variables, new data were
generated in 100 iterations, and then analysed with k-means, HDBSCAN, or c-means (see below).
modelling, we simulated datasets of 120 observations and 2 (uncorrelated) features with one, two,
three, or four equally sized and equidistant multivariate normal distributions. Their separations
varied from Δ=1 to 10, and they were analysed with k-means, c-means, and Gaussian mixture
modelling (see below).
Open Code and Data
Data was simulated in Python (37) (version 2.7.12; for a tutorial, see reference (38)), using the
NumPy package (version 1.16.5) (39,40). Dimensionality reduction and clustering was performed
using the packages SciPy (version 1.2.2) (39), umap-learn (version 0.2.1) (31,41), hdbscan (version
0.8.12) (34), scikit-fuzzy (version 0.4.2), and scikit-learn (version 0.20.4) (2). Outcomes were
plotted using Matplotlib (version 2.1.2) (42). All code and simulated data used for this manuscript
can be found on GitHub, from where it can be freely accessed and downloaded:
www.github.com/esdalmaijer/cluster_power
here, and can be altered to test additional types of dimensionality reduction and clustering
algorithms. We have already implemented 13 further dimensionality reduction algorithms, and 8
additional cluster algorithms. Researchers with a special interest in any of these are welcome to use our resource to compute statistical power for their specific situation.
Dimensionality reduction
Cluster analysis is usually performed on high-dimensional data, i.e. with many measured features
per observation. While it is possible to apply clustering algorithms directly, the “curse of
dimensionality” entails that this approach is unlikely to yield strong results (26). Instead, many opt for projecting high-dimensional data into a lower-dimensional space. One option for this is principal
component analysis (PCA), but extracting only a few components risks removing meaningful
variance. Instead, data can be projected in two-dimensional space with limited loss of information
with multi-dimensional scaling (30) (MDS), a technique that aims to retain inter-observation
distances in original data in a lower-dimensional projection. Finally, algorithms such as t-stochastic neighbour embedding (t-SNE) (43) and uniform manifold approximation and projection (UMAP)
(31) non-linearly reduce dimensionality, effectively retaining local inter-sample distances while
exaggerating global distances. An additional advantage of these techniques is that data projected into two or three dimensions can be plotted, and thus visually inspected for oddities, and perhaps even provide a rough indication of grouping.
Clustering
After dimensionality reduction, the resulting dataset can be subjected to a wide selection of
clustering algorithms that each have optimal conditions (for an overview, see (44)). Here, we will explore the most common types. This includes k-means (32), an algorithm that arbitrarily draws a predefined number (k) of centroids within the data, and on each iteration moves the centroids to the average of the observations that are closest to each centroid, until a stable solution is reached.
Another approach is agglomerative (hierarchical) clustering, which recursively joins pairs of
observations according to a combination of linkage affinity (e.g. Euclidean or cosine distance) and criterion. A commonly used linkage is Ward, which minimises the variance of merging groups of
observations (33). Because these algorithms require the user to define the number of clusters, a
common approach is to cycle through a variety of options to identify the best fitting solution.
clusters includes DBSCAN (45) and HDBSCAN (34). They identify clusters of denser observations among lower-density observations that remain unassigned.
Euclidean distance, agglomerative clustering with average linkage and cosine distance, HDBSCAN, and c-means (fuzzy clustering; see below).
Outcome evaluation
After observations are assigned a cluster, the quality of the solution can be determined. For each
sample, a silhouette coefficient can be computed as the relative distance to its assigned centroid and
the nearest other centroid (46). For each observation, a value of 1 means perfect alignment with its
assigned centroid, 0 means it lies exactly in between its centroid and the nearest other, and -1 means
perfect alignment with a centroid it was not assigned to. The average across all assigned
observations is the silhouette score, which is often taken as evidence for clustering if it exceeds 0.5, or as strong evidence if it exceeds 0.7 (47). It should be noted that there are many cluster validation
indices (for excellent overviews, see (19,48)). We focus on the silhouette score because of its good
performance in many circumstances (19), conceptual elegance, and established thresholds for
interpretation. Unassigned observations (such as in HDBSCAN) are ignored for silhouette score
computation. Scores were computed slightly differently for fuzzy clustering tools (see section
“Fuzzy Clustering” below), but interpreted in the same framework.
allowed us to compute the Rand index (49), adjusted for chance (50), to quantify the overlap
between cluster outcome and ground truth. An adjusted Rand index of 1 reflects perfect match, a
value of 0 means chance performance, and negative values indicate the clustering performed worse
than chance. While the adjusted Rand index quantifies the overlap between cluster outcome and
truth, the silhouette coefficient reflects what an experimenter (who is normally blind to the ground
truth) would conclude.
Table 5
Outcome evaluation metrics and their meanings.
0.5 – 0.7 Evidence for clustering
0.7 – 1.0 Strong evidence for clustering
Fuzzy clustering and mixture modelling
We employed the c-means algorithm (51,52), specifically the version implemented in Python
package scikit-fuzzy (53). It converges on centroids in a similar way to k-means, but allows for
observations to be assigned to more than one cluster. Specifically, each observation is assigned k values between 0 and 1 that indicate membership likelihood.
GaussianMixture class (2). This approach aims to find the best mixture of k Gaussian distributions,
allowing each component their own general covariance matrix. The resulting model was used to
compute for each observation the probability that it was part of each Gaussian; a similar outcome to the c-means algorithm.
coefficient intended for fuzzy clustering methods (54). This silhouette score has a tunable
exponentiation parameter α that determines how strongly the uncertainty about each observations
cluster membership is weighted (when it approaches 0, the fuzzy silhouette coefficient approaches
the regular version), which was set to 1 in our analyses.
fitted solution. Where the silhouette score should be maximised to identify the best cluster solution, the BIC should be minimised.
normal distributions with SD=1) with separations of Δ=1 to 10. A new dataset was simulated in 100
iterations. In each, k-means and c-means were applied, with predefined guesses of k=2 to 7. From
the outcomes, we computed the probability of each analysis to detect clustering (silhouette
coefficient >= 0.5), and to detect the correct number of clusters (silhouette coefficient highest for
the value of k that corresponded with ground truth).
Effect of dimensionality reduction on cluster separation
In the simulated datasets, distances between cluster centroids in original feature space should be
Euclidean (Equation 1). However, due to the stochastic and non-linear nature of dimensionality
reduction algorithms, centroid distances after dimensionality reduction are less predictable. To
quantify the effect of dimensionality reduction on cluster separation, we computed the distance
between cluster centroids (defined as the average Euclidean position of observations within a
cluster) in projected space after dimensionality reduction.
(1) 1
Where Δ is centroid distance, n is the number of features, and δ is the within-feature difference
between clusters (effect size).
Introducing effect size Δ
We consider multivariate normal distributions with standard deviations of 1 (for all features) to be standardised space, and refer to cluster separation in this space as Δ. It can serve as an effect size metric for clustering in the sense that it reflects the extent of separation of simulated or identified subgroups. It is essentially the multivariate equivalent of Cohen’s d, and can in fact be estimated from expected values of Cohen’s d within each feature via Equation 1.
Power and accuracy
Researchers who opt for cluster analysis are likely attempting to answer three main questions: 1) Are subgroups present in my data, 2) How many subgroups are present in my data, and 3) Which
observations belong to what subgroup?
rejected if an alternative hypothesis is true. Various approaches have been suggested to define
statistical power in cluster analyses, for example through outcome permutation (55) or through
measures of subgroup overlap (56). Here, we define power as the likelihood of a cluster analysis to accurately reject the null hypothesis that no subgroups are present, based on the binary decision of a solution’s silhouette score being 0.5 or over (47).
probability of cluster analyses to identify the correct number of subgroups in simulated datasets. This was done by cycling through k=2 to k=5 for algorithms that require pre-specification of cluster number (k-means and agglomerative clustering), and choosing the value that resulted in the highest silhouette coefficient. HDBSCAN reports the number of detected clusters, and thus did not require iterating through pre-specified values.
assigned to their respective clusters. This reflects the overlap between ground truth and assigned
cluster membership. This accuracy is dependent on chance, which was set at the proportion of the
total sample size that was in the largest cluster.
Ethics approval and consent to participate
No participants were tested for this study, which relies on simulations only.
Availability of data and materials
The authors have made their data and analysis software available publicly. It can be accessed on
Competing interests
The authors declare that they have no competing interests.
Funding
ESD and DEA are supported by grant TWCF0159 from the Templeton World Charity Foundation to
DEA. CLN is supported by an AXA Fellowship. All authors are supported by the UK Medical
Research Council, grant MC-A0606-5PQ41.
Authors’ contributions
ESD and CLN first initiated the study. ESD conceptualised the study, developed the methods,
simulated and analysed the data, and drafted the manuscript. All authors interpreted the results,
provided critical feedback on the manuscript, and approved of the final version.
1. Handelsman DJ, Teede HJ, Desai R, Norman RJ, Moran LJ. Performance of mass spectrometry steroid profiling for diagnosis of polycystic ovary syndrome. Hum Reprod. 2017 Feb;32(2):418–22.
2. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
3. Hennig C. fpc [Internet]. 2020. Available from: https://cran.rproject.org/web/packages/fpc/index.html
4. Anjana RM, Baskar V, Nair ATN, Jebarani S, Siddiqui MK, Pradeepa R, et al. Novel subgroups of type 2 diabetes and their association with microvascular outcomes in an Asian Indian population: a data-driven cluster analysis: the INSPIRED study. BMJ Open Diabetes Res Care. 2020 Aug;8(1):e001506.
5. Tao R, Yu X, Lu J, Shen Y, Lu W, Zhu W, et al. Multilevel clustering approach driven by continuous glucose monitoring data for further classification of type 2 diabetes. BMJ Open Diabetes Res Care. 2021 Feb;9(1):e001869.
6. Carrillo-Larco RM, Castillo-Cara M, Anza-Ramirez C, Bernabé-Ortiz A. Clusters of people with type 2 diabetes in the general population: unsupervised machine learning approach using national surveys in Latin America and the Caribbean. BMJ Open Diabetes Res Care. 2021 Jan;9(1):e001889.
7. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018 May;6(5):361–9.
8. Jonsson PF, Cavanna T, Zicha D, Bates PA. Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics. 2006;7(1):2.
9. De La Monte SM, Moore WM, Hutchins GM. Metastatic behavior of prostate cancer: Cluster analysis of patterns with respect to estrogen treatment. Cancer. 1986;58(4):985–93.
10. Lawton M, Ben-Shlomo Y, May MT, Baig F, Barber TR, Klein JC, et al. Developing and validating Parkinson’s disease subtypes and their motor and cognitive progression. J Neurol Neurosurg Psychiatry. 2018 Dec;89(12):1279–87.
11. Bathelt J, Johnson A, Zhang M, the CALM team, Astle DE. Data-driven brain-types and their cognitive consequences [Internet]. Neuroscience; 2017 Dec [cited 2020 Feb 20]. Available from: http://biorxiv.org/lookup/doi/10.1101/237859
12. Astle DE, Bathelt J, The CALM Team, Holmes J. Remapping the cognitive and neural profiles of children who struggle at school. Dev Sci. 2019 Jan;22(1):e12747.
13. Bathelt J, Holmes J, Astle DE, The CALM Team. Data-Driven Subtyping of Executive Function–Related Behavioral Problems in Children. J Am Acad Child Adolesc Psychiatry. 2018 Apr;57(4):252-262.e4.
14. Benjamins JS, Dalmaijer ES, Ten Brink AF, Nijboer TCW, Van der Stigchel S. Multi-target visual search organisation across the lifespan: cancellation task performance in a large and demographically stratified sample of healthy adults. Aging Neuropsychol Cogn. 2019;26(5):731–48.
15. Rennie JP, Zhang M, Hawkins E, Bathelt J, Astle DE. Mapping differential responses to cognitive training using machine learning. Dev Sci [Internet]. 2019 Jul 22 [cited 2019 Jul 30]; Available from: https://onlinelibrary.wiley.com/doi/abs/10.1111/desc.12868
16. Uh S, Dalmaijer ES, Siugzdaite R, Ford TJ, Astle DE. Two Pathways to Self-Harm in Adolescence. J Am Acad Child Adolesc Psychiatry. 2021;S0890856721002197.
17. Dalmaijer ES, Van Rheede J, Sperr EV, Tkotz J. Banana for scale: Gauging trends in academic interest by normalising publication rates to common and innocuous keywords. ArXiv210206418 Cs [Internet]. 2021 Feb 12 [cited 2021 Apr 7]; Available from: http://arxiv.org/abs/2102.06418
18. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936 Sep;7(2):179–88.
19. Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013 Jan;46(1):243–56.
20. Dubes RC. How many clusters are best? - An experiment. Pattern Recognit. 1987;20(6):645– 63.
21. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol. 2001;63(2):411–23.
22. Hennig C. What are the true clusters? Pattern Recognit Lett. 2015 Oct;64:53–62.
23. Franco M, Vivo J-M. Cluster Analysis of Microarray Data. In: Bolón-Canedo V, AlonsoBetanzos A, editors. Microarray Bioinformatics [Internet]. New York, NY: Springer New York; 2019 [cited 2021 May 23]. p. 153–83. Available from: http://link.springer.com/10.1007/978-1-4939-9442-7_7
24. Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005 Aug 1;21(15):3201–12.
25. Ronan T, Qi Z, Naegle KM. Avoiding common pitfalls when clustering biological data. Sci Signal. 2016 Jun 14;9(432):re6–re6.
26. Bellman R. Dynamic programming. Princeton: Princeton University Press; 1957.
27. Ioannidis JPA. Why Most Published Research Findings Are False. PLoS Med. 2005 Aug 30;2(8):e124.
28. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013 Apr 10;14(5):365–76.
29. Nord CL, Valton V, Wood J, Roiser JP. Power-up: A Reanalysis of “Power Failure” in Neuroscience Using Mixture Modeling. J Neurosci. 2017 Aug 23;37(34):8051–61.
30. Kruskal J. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964;29(1):1–27.
31. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv180203426 Cs Stat [Internet]. 2018 Dec 6 [cited 2020 Feb 20]; Available from: http://arxiv.org/abs/1802.03426
32. Lloyd SP. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129–37.
33. Ward JH. Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc. 1963;58(301):236–44.
34. McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. J Open Source Softw. 2017 Mar 21;2(11):205.
35. van Loo HM, de Jonge P, Romeijn J-W, Kessler RC, Schoevers RA. Data-driven subtypes of major depressive disorder: a systematic review. BMC Med. 2012 Dec;10(1):156.
36. Menger V, Spruit M, Hagoort K, Scheepers F. Transitioning to a Data Driven Mental Health Practice: Collaborative Expert Sessions for Knowledge and Hypothesis Finding. Comput Math Methods Med. 2016;2016:1–11.
37. Van Rossum G, Drake FL. Python Language reference manual. Bristol, UK: Network Theory Ltd.; 2011.
38. Dalmaijer ES. Python for experimental psychologists. Abingdon, Oxon; New York, NY: Routledge; 2017.
39. Oliphant TE. Python for Scientific Computing. Comput Sci Eng. 2007;9(3):10–20.
40. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020 Sep 17;585(7825):357–62.
41. McInnes L, Healy J, Saul N, Großberger L. UMAP: Uniform Manifold Approximation and Projection. J Open Source Softw. 2018 Sep 2;3(29):861.
42. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007;9(3):90–5.
43. Van der Maaten LJP, Hinton GE. Visualizing high-dimensional data using t-SNE. J Mach Learn Res. 2008;9:2579–605.
44. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv CSUR. 1999 Sep;31(3):264–323.
45. Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD. 1996;96(34):226–31.
46. Rousseeuw P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987 Nov;20:53–65.
47. Kaufman L, Rousseeuw PJ, editors. Finding Groups in Data [Internet]. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 1990 [cited 2018 Mar 4]. (Wiley Series in Probability and Statistics). Available from: http://doi.wiley.com/10.1002/9780470316801
48. Vendramin L, Campello RJGB, Hruschka ER. Relative clustering validity criteria: A comparative overview. Stat Anal Data Min. 2010 Jun 30;n/a-n/a.
49. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.
50. Hubert L, Arabie P. Comparing partitions. J Classif. 1985 Dec;2(1):193–218.
51. Bezdek JC. Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press; 1981. 256 p. (Advanced applications in pattern recognition).
52. Dunn JC. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact WellSeparated Clusters. J Cybern. 1973 Jan;3(3):32–57.
53. Ross TJ. Chapter 10: Fuzzy Classification (subheading: Fuzzy c-Means Algorithm). In: Fuzzy logic with engineering applications. 3rd ed. Wiley; 2010. p. 352–3.
54. Campello RJGB, Hruschka ER. A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst. 2006 Nov;157(21):2858–75.
55. Baker FB, Hubert LJ. Measuring the Power of Hierarchical Cluster Analysis. J Am Stat Assoc. 1975;70(349):31–8.
56. Sneath PH. A method for testing the distinctness of clusters: a test of the disjunction of two clusters in Euclidean space as measured by their overlap. Math Geol. 1977;7(2):123–43.