A novel initialisation based on hospital-resident assignment for the k-modes algorithm

2020·Arxiv

Abstract

Abstract

This paper presents a new way of selecting an initial solution for the k-modes algorithm that allows for a notion of mathematical fairness and a leverage of the data that the common initialisations from literature do not. The method, which utilises the Hospital-Resident Assignment Problem to find the set of initial cluster centroids, is compared with the current initialisations on both benchmark datasets and a body of newly generated artificial datasets. Based on this analysis, the proposed method is shown to outperform the other initialisations in the majority of cases, especially when the number of clusters is optimised. In addition, we find that our method outperforms the leading established method specifically for low-density data.

1 Introduction

This work focusses on k-modes clustering — an extension to k-means that permits the sensible clustering of categorical (i.e. ordinal, nominal or otherwise discrete) data as set out in the seminal works by Huang [9, 10, 11]. In particular, the interest of this paper is in how the performance of the k-modes algorithm is affected by the quality of its initial solution. The initialisation method proposed in this work extends the method presented by Huang [11] by using results from game theory to ensure mathematical fairness and to lever the full learning opportunities presented by the data being clustered. In doing so, it is demonstrated that the proposed method is able to outperform both of the established initialisations for k-modes. The paper is structured as follows:

• Section 1 introduces the k-modes algorithm and its established initialisation methods.

• Section 2 provides a brief overview of matching games and their variants before a statement of the proposed initialisation.

• Section 3 presents analyses of the initialisations on benchmark and new, artificial datasets.

• Section 4 concludes the paper.

1.1 The k-modes algorithm

The following notation will be used throughout this work to describe the objects associated with clustering a categorical dataset:

• Let denote the attribute space. In this work, only categorical attributes are considered, i.e. for each j = 1, . . . , m it follows that

• Let where each is defined as an . The elements of X are referred to as data points or instances.

• Let ) be a partition of a dataset distinct, non-empty parts. Such a partition Z is called a clustering of X.

• Each cluster has associated with it a mode (see Definition 2) which is denoted by . These points are also referred to as representative points or centroids.

Definition 1 describes a dissimilarity measure between categorical data points.

be a dataset and consider any . The dissimilarity between , denoted by , is given by:

With this metric, the notion of a representative point of a cluster is addressed. With numeric data and k-means, such a point is taken to be the mean of the points within the cluster. With categorical data, however, the mode is used as the measure for central tendency. This follows from the concept of dissimilarity in that the point that best represents (i.e. is closest to) those in a cluster is one with the most frequent attribute values of the points in the cluster. The following definitions and theorem formalise this and a method to find such a point.

be a dataset and consider some point called a mode of X if it minimises the following:

be a dataset. Then denotes the frequency of the

, i.e. for each

Theorem 1. Consider a dataset minimised if and only if A proof of this theorem can be found in the Appendix of [11].

Theorem 1 defines the process by which cluster modes are updated in k-modes (see Algorithm 3), and so the final component from the k-means paradigm to be configured is the objective (cost) function. This function is defined in Definition 4, and following that a practical statement of the k-modes algorithm is given in Algorithm 1 as set out in [11].

be a clustering of a dataset the corresponding cluster modes. Then partition matrix of X such that:

1.2 Initialisation processes

The standard selection method to initialise k-modes is to randomly sample k distinct points in the dataset. In all cases, the initial modes must be points in the dataset to ensure that there are no empty clusters in the first iteration of the algorithm. The remainder of this section describes two well-established initialisation methods that aim to preemptively lever the structure of the data at hand.

1.2.1 Huang’s method

Amongst the original works by Huang, an alternative initialisation method was presented that selects modes by distributing frequently occurring values from the attribute space among k potential modes [11]. The process, denoted as Huang’s method, is described in full in Algorithm 4. Huang’s method considers a set of potential modes, , that is then replaced by the actual set of initial modes, . The statement of how the set of potential modes are formed is ambiguous in the original paper — as is alluded to in [13]. Here, as is done in practical implementations of k-modes, this has been interpreted as being done via a weighted random sample (see Algorithm 5).

1.2.2 Cao’s method

The second initialisation process that is widely used with k-modes is known as Cao’s method [3]. This method selects the initial modes according to their density in the dataset whilst forcing dissimilarity between them. Definition 5 formalises the concept of density and its relationship to relative frequency. The method, which is described in Algorithm 6, is deterministic — unlike Huang’s method which relies on random sampling.

Definition 5. Consider a dataset average density of any point

with respect to A is defined [3] as:

2 Matching games and the proposed method

Both of the initialisation methods described in Section 1.2 have a greedy component. Cao’s method essentially chooses the densest point that has not already been chosen whilst forcing separation between the set of initial modes. In the case of Huang’s, however, the greediness only comes at the end of the method, when the set of potential modes is replaced by a set of instances in the dataset. Specifically, this means that in any practical implementation of this method the order in which a set of potential modes is iterated over can affect the set of initial modes. Thus, there is no guarantee of consistency.

The initialisation proposed in this work extends Huang’s method to be order-invariant in the final allocation — thereby eliminating its greedy component — and provides a more intuitive starting point for the k-modes algorithm. This is done by constructing and solving a matching game between the set of potential modes and some subset of the data.

In general, matching games are defined by two sets (parties) of players in which each player creates a preference list of at least some of the players in the other party. The objective then is to find a ‘stable’ mapping between the two sets of players such that no pair of players is (rationally) unhappy with their matching. Algorithms to ‘solve’ — i.e. find stable matchings to — instances of matching games are often structured to be party-oriented and aim to maximise some form of social or party-based optimality [6, 7, 8, 12, 15, 16].

The particular constraints of this case — where the k potential modes must be allocated to a nearby unique data point — mirror those of the so-called Hospital-Resident Assignment Problem (HR). This problem gets its name from the real-world problem of fairly allocating medical students to hospital posts. A resident-optimal algorithm for solving HR was presented in [8] and was adapted in [20] to take advantage of the structure of the game. This adapted algorithm is given in Algorithm 7. A practical implementation of this algorithm has been implemented in Python as part of the matching library [24] and is used in the implementation of the proposed method for Section 3.

The game used to model HR, its matchings, and its notion of stability are defined in Defini-tions 6—8. A summary of these definitions in the context of the proposed k-modes initialisation is given in Table 1 before a formal statement of the proposed method in Algorithm 11.

Definition 6. Consider two distinct sets R, H and refer to them residents and hospitals. Each has a capacity associated with them. Each player has associated with it a strict preference list of the other set’s elements such that:

• Each ranks a non-empty subset of H, denoted by f(r).

• Each ranks all and only those residents that have ranked it, i.e. the preference list of h, denoted g(h), is a permutation of the set . If no such residents exist, h is removed from H.

This construction of residents, hospitals, capacities and preference lists is called a game and is denoted by (R, H).

Definition 7. Consider a game (R, H). A matching M is any mapping between R and H. If a pair (are matched in M then this relationship is denoted A matching is only considered valid if all of the following hold for all

• If r is matched then

• If h has at least one match then

• h is not over-subscribed, i.e.

Definition 8. Consider a game (R, H). Then a pair (is said to block a matching M if all of the following hold:

• There is mutual preference, i.e.

• Either r is unmatched or they prefer h to M(r).

• Either h is under-subscribed or h prefers r to at least one resident in

Table 1: A summary of the relationships between the components of the initialisation for k-modes and those in a matching game (R, H).

3 Experimental results

To give comparative results on the quality of the initialisation processes considered in this work, four well-known, categorical, labelled datasets — breast cancer, mushroom, nursery, and soybean (large) — will be clustered by the k-modes algorithm with each of the initialisation processes. These datasets have been chosen to fall in line with the established literature, and for their relative sizes and complexities. Each dataset is openly available under the UCI Machine Learning Repository [5], and their characteristics are summarised in Table 2. For the purposes of this analysis, incomplete instances (i.e. where data is missing) are excluded and the remaining dataset characteristics are reported as ‘adjusted’.

All of the source code used to produce the results and data in this analysis — including the datasets investigated in Section 3.3 — are archived at DOI 10.5281/zenodo.3639282. In addition to this, the implementation of the k-modes algorithm and its initialisations is available under DOI 10.5281/zenodo.3638035.

Table 2: A summary of the benchmark datasets.

This analysis does not consider evaluative metrics related to classification such as accuracy, recall or precision as is commonly done [1, 3, 4, 11, 18, 19, 22, 23]. Instead, only internal measures are considered such as the cost function defined in (4). This metric is label-invariant and its values are comparable across the different initialisation methods. Furthermore, the effect of each initialisation method on the initial and final clusterings can be captured with the cost function. An additional, and often useful, metric is the silhouette coefficient. This measures the ratio between the intra-cluster cohesion and inter-cluster separation of a particular clustering. Therefore, it could be used in a similar way to reveal the effect of each initialisation method at the beginning and end of a run of k-modes. Unfortunately, this metric loses its intuition under the distance measure employed here and is omitted. The remaining performance measures used are the number of iterations for the k-modes algorithm to terminate and the time taken to terminate in seconds.

The final piece of information required in this analysis is a choice for k for each dataset. An immediate choice is the number of classes that are present in a dataset but this is not necessarily an appropriate choice since the classes may not be representative of true clusters [17]. However, this analysis will consider this case as there may be practical reasons to limit the value of k. The other strategy for choosing k considered in this work uses the knee point detection algorithm introduced in [21]. This strategy was chosen over other popular methods such as the ‘elbow’ method as its results are definitive.

The knee point detection algorithm was employed over values of k from 2 up to dataset. The number of clusters determined by this strategy is reported in the final column of Table 2.

3.1 Using knee point detection algorithm for k

Tables 3—6 summarise the results of each initialisation method on the benchmark datasets where the number of clusters has been determined by the knee point detection algorithm. Each column shows the mean value of each metric and its standard 250 repetitions of the k-modes algorithm.

By examining these tables it would seem that the proposed method and Huang’s method are comparable across the board — although the proposed method is faster despite taking more iterations in general which may relate to a more intuitive initialisation. More importantly though, it appears that Cao’s method performs the best out of the three initialisation methods: in terms of initial and final costs Cao’s method improves, on average, by roughly 10 percent against the next best method for the three datasets that it succeeds with; the number of iterations is comparable; and the computation time is substantially less than the other two considering it is a deterministic method and need only be run once to achieve this performance.

However, in the k-means paradigm, a particular clustering is selected based on it having the minimum final cost over a number of runs of the algorithm — not the mean — and while Cao’s method is very reliable, in that there is no variation at all, it does not always produce the best clustering possible. There is a trade-off to be made between computational time and performance here. In order to gain more insight into the performance of each method, less granular analysis is required. Figures 1—4 display the cost function results for each dataset in the form of a scatter plot and two empirical cumulative density function (CDF) plots, highlighting the breadth and depth of the behaviours exhibited by each initialisation method.

Looking at Figure 1 it is clear that in terms of final cost Cao’s method is middling when compared to the other methods. This was apparent from Table 3 and, indeed, Huang’s and the proposed method are both very comparable when looking at the main body of the results. However,

Table 3: Summative metric results for the breast cancer dataset with k = 8.

Table 4: Summative metric results for the mushroom dataset with k = 17.

Table 5: Summative metric results for the nursery dataset with k = 23.

Table 6: Summative metric results for the soybean dataset with k = 8.

Figure 1: Summative plots for the breast cancer dataset with k = 8.

since the criterion for the best clustering (in practical terms) is having the minimum final cost, it is evident that the proposed method is superior; that the method produces clusterings with a larger cost range (indicated by the trailing right-hand side of each CDF plot) is irrelevant for the same reason.

This pattern of largely similar behaviour between Huang’s and the proposed method is apparent in each of the figures here, and in each case the proposed method outperforms Huang’s. In fact, in all cases except for the nursery dataset, the proposed method achieves the lowest final cost of all the methods and, as such, performs the best in practical terms on these particular datasets.

In the case of the nursery dataset, Cao’s method is unquestionably the best performing initialisation method. It should be noted that none of the methods were able to find an initial clustering that could be improved on, and that this dataset exactly describes the entire attribute space in which it exists. This property could be why the other methods fall behind Cao’s so decisively in that Cao’s method is able to definitively choose the k most dense-whilst-separated points from the attribute space as the initial cluster centres whereas the other two methods are in essence randomly sampling from this space. That each initial solution in these repetitions is locally optimal remains a mystery.

3.2 Using number of classes for k

As is discussed above, the often automatic choice for k is the number of classes present in the data; this subsection repeats the analysis from the subsection above but with this traditional choice for k. Tables 7—10 contain the analogous summaries of each initialisation method’s performance on the benchmark datasets over the same number of repetitions.

Figure 2: Summative plots for the mushroom dataset with k = 17.

Figure 3: Summative plots for the nursery dataset with k = 23.

Table 7: Summative metric results for the breast cancer dataset with k = 2.

Table 8: Summative metric results for the mushroom dataset with k = 2.

Table 9: Summative metric results for the nursery dataset with k = 5.

Table 10: Summative metric results for the soybean dataset with k = 15.

Figure 4: Summative plots for the soybean dataset with k = 8.

An immediate comparison to the previous tables is that for all datasets bar the soybean dataset, the mean costs are significantly higher and the computation times are lower. These effects come directly from the choice of k in that higher values of k will require more checks (and thus computational time) but will typically lead to more homogeneous clusters, reducing their within-cluster dissimilarity and therefore cost.

Looking at these tables on their own, Cao’s method is the superior initialisation method on average: the means are substantially lower in terms of initial and final cost; there is no deviation in these results; again, the total computational time is a fraction of the other two methods. It is also apparent that Huang’s method and the proposed extension are very comparable on average. As before, finer investigation will require finer visualisations. Figures 5—8 show the same plots as in the previous subsection except the number of clusters has been taken to be the number of classes present in each dataset.

Figures 5 & 6 indicate that a particular behaviour emerged during the runs of the k-modes algorithm. Specifically, each solution falls into one of (predominantly) two types: effectively no improvement on the initial clustering, or terminating at some clustering with a cost that is bounded below across all such solutions. Invariably, Cao’s method achieves or approaches this lower bound and unless Cao’s method is used, these particular choices for k mean that the performance of the k-modes algorithm is exceptionally sensitive to its initial clustering. Moreover, the other two methods are effectively indistinguishable in these cases and so if a robust solution is required, Cao’s method is the only viable option.

Figure 7 corresponds to the nursery dataset results with k = 5. In this set of runs, the same pattern emerges as in Figure 3 where sampling the initial centres from amongst the most dense points (via Huang’s method and the proposed) is an inferior strategy to one considering the entire

Figure 5: Summative plots for the breast cancer dataset with k = 2.

attribute space such as with Cao’s method. Again, no method is able to improve on the initial solution except for one repetition with the matching initialisation method.

The primary conclusion from this analysis is that while Huang’s method is largely comparable to the proposed extension, there is no substantial evidence from these use cases to use Huang’s method over the one proposed in this work. In fact, Figure 8 is the only instance where Huang’s method was able to outperform the proposed method. Other than this, the proposed method consistently performing better (or as well as) Huang’s method in terms of minimal final costs and computational time over a number of runs in both the cases where an external framework is imposed on the data (by choosing k to be the number of classes) and not. Furthermore, though not discussed in this work, the matching initialisation method has the scope to allow for expert or prior knowledge to be included in an initial clustering by using some ad hoc preference list mechanism.

3.3 Artificial datasets

Following on from the conclusions of the analysis thus far, the competition between Cao’s method and the proposed matching method may be studied more deeply. All of the results leading up to this point were conducted using benchmark datasets and while there are certainly benefits to comparing methods in this way, it does not afford a rich understanding of how any of them perform more generally. This stage of the analysis relies on a method for generating artificial datasets introduced in [25]. In essence, this method is an evolutionary algorithm which acts on entire datasets to explore the space in which potentially all possible datasets exist. The key component of this method is an objective function that takes a dataset and returns a value that is to be minimised; this function is referred to as the fitness function.

Figure 6: Summative plots for the mushroom dataset with k = 2.

Figure 7: Summative plots for the nursery dataset with k = 5.

Figure 8: Summative plots for the soybean dataset with k = 15.

In order to reveal the nuances in the performance of Cao’s method and the proposed initialisation on a particular dataset, two cases are considered: where Cao’s method outperforms the proposed, and vice versa. Both cases use the same fitness function — with the latter using its negative — which is defined as follows:

where are the final costs when a dataset X is clustered using Cao’s method and the proposed matching method respectively with k = 3. For the sake of computational time, the proposed initialisation was given 25 repetitions as opposed to the 250 repetitions in the remainder of this section. Apart from the sign of f, the dataset generation processes used identical parameters in each case and the datasets considered here are all of comparable shape.

This process yielded approximately 35,000 unique datasets for each case, and the ensuing analysis only considers the top-performing percentile of datasets from each. Figure 9 shows the fitness distribution of the top percentile in each case. It should be clear from (7) that large negative values are preferable here. With that, and bearing in mind that the generation of these datasets was parameterised in a consistent manner, it appears that the attempt to outperform Cao’s method proved somewhat easier. This is indicated by the substantial difference in the locations of the fitness distributions.

Given the quantity of data available, to understand the patterns that have emerged, they must be summarised; in this case, univariate statistics are used. Despite the datasets all being of similar shapes, there are some discrepancies. With the number of rows this is less of an issue but any comparison of statistics across datasets of different widths is difficult without prior knowledge of the datasets. Moreover, there is no guarantee of contingency amongst the attributes, and the

Figure 9: Histograms of fitness for the top performing percentile in each case.

comparison of more than a handful of variables becomes complicated even when the attributes are identifiable. To combat this and bring uniformity to the datasets, each dataset is represented as their first principal component obtained via centred Principal Component Analysis (PCA) [14]. While some subtleties may be lost, this representation captures the most important characteristics of each dataset in a single variable meaning they can be compared directly.

Since the transformation by PCA is centred, all measures for central tendency are moot. In fact, the mean and median are not interpretable here given that the original data is categorical. As such, the univariate statistics used here describe the spread and shape of the principal components, and are split into two groups:

• Central moments: variance, skewness and kurtosis.

• Empirical quantiles: interquartile range, lower decile and upper decile.

Figures 10 & 11 show the distributions of the six univariate statistics across all of the principal components in each case. In addition to this, they show a fitted Gaussian kernel density estimate [2] to accentuate the general shape of the histograms. What becomes immediately clear from each of these plots is that for datasets where Cao’s method succeeds, the general spread of their first principal component is much tighter than in the case where the proposed initialisation method succeeds. This is particularly evident in Figure 10a where relatively low variance in the first case indicates a higher level of density in the original categorical data.

The patterns in the quantiles further this. Although Figure 11a suggests that the components of Cao-preferable datasets can have higher interquartile ranges than in the second case, the lower and upper deciles tend to be closer together as is seen in Figures 11b & 11c. This suggests that despite the body of the component being spread, its extremities are not.

In Figures 10b & 10c, the most notable contrast between the two cases is the range in values for both skewness and kurtosis. This supports the evidence thus far that individual datasets have higher densities and lower variety (i.e. tighter extremities) when Cao’s method succeeds over the proposed initialisation. In particular, larger values of skewness and kurtosis translate to high similarity between the instances in a categorical dataset which is equivalent to having high density.

Overall, this analysis has revealed that if a dataset shows clear evidence of high-density points, then Cao’s method should be used over the proposed method. However, if there is no such evidence, the proposed method is able to find a substantially better clustering than Cao’s method.

4 Conclusion

In this paper a novel initialisation method for the k-modes was introduced that built on the method set out in the seminal paper [11]. The new method models the final ‘replacement’ process in the original as an instance of the Hospital-Resident Assignment Problem that may be solved to be mathematically fair and stable.

Following a thorough description of the k-modes algorithm and the established initialisation methods, a comparative analysis was conducted amongst the three initialisations using both benchmark and artificial datasets. This analysis revealed that the proposed initialisation was able to outperform both of the other methods when the choice of k was optimised according to a mathematically rigorous elbow method. However, the proposed method was unable to beat Cao’s method (established in [3]) when an external framework was imposed on each dataset by choosing k to be the number of classes present.

The proposed method should be employed over Cao’s when there are no hard restrictions on what k may be, or if there is no immediate evidence that the dataset at hand has some notion of high density. Otherwise, Cao’s method remains the most reliable initialisation in terms of computational time and final cost.

References

[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, 2007. ISBN 978-0-898716-24-5. URL http://dl.acm.org/citation.cfm?id= 1283383.1283494.

[2] D. M. Bashtannyk and R. J. Hyndman. Bandwidth selection for kernel conditional density estimation. Computational Statistics and Data Analysis, 36:279–298, 2001. ISSN 0167-9473.

[3] F. Cao, J. Liang, and L. Bai. A new initialization method for categorical data clustering. Expert Systems with Applications, 36:10223–10228, 2009. URL https://pdfs.semanticscholar. org/1955/c6801bca5e95a44e70ce14180f00fd3e55b8.pdf.

[4] F. Cao, J. Liang, D. Li, L. Bai, and C. Dang. A dissimilarity measure for the k-modes clustering algorithm. Knowledge-Based Systems, 26:120–127, 2012. doi: 10.1016/j.knosys.2011.07.011.

[5] D. Dua and C. Graff. UCI Machine Learning Repository, 2017. URL http://archive.ics. uci.edu/ml.

[6] A. Erdil and H. Ergin. Two-sided matching with indifferences. Journal of Economic Theory, 171:268–292, 2017. doi: 10.1016/j.jet.2017.07.002.

0.00

0.01

0.02

0.03

0

2

4

6

0.0

0.5

1.0

1.5

2.0

2.5

Figure 10: Distribution plots for the (a) variance, (b) skewness and (c) kurtosis of the first principal components in each case.

0.00

0.05

0.10

0.15

0.20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.00

0.05

0.10

0.15

0.20

Figure 11: Distribution plots for the (a) interquartile range, (b) lower decile and (c) upper decile of the first principal components in each case.

[7] T. Fuku, A. Namatame, and T. Kaizoji. Collective Efficiency in Two-Sided Matching, pages 115–126. 2006. doi: 10.1007/3-540-28547-4 10.

[8] D. Gale and L. Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962. doi: 10.2307/2312726.

[9] Z. Huang. Clustering large data sets with mixed numeric and categorical values. In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 21–34, 1997.

[10] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 1–8, 1997.

[11] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283–304, 1998. doi: 10.1023/A: 1009769707641.

[12] K. Iwama and S. Miyazaki. Stable Marriage with Ties and Incomplete Lists, pages 2071–2075. Springer New York, 2016. doi: 10.1007/978-1-4939-2864-4 805.

[13] F. Jiang, G. Liu, J. Du, and Y. Sui. Initialization of k-modes clustering using outlier detection techniques. Information Sciences, 332:167–183, 2016. doi: 10.1016/j.ins.2015.11.005.

[14] I. T. Jolliffe. Principal Component Analysis and Factor Analysis, pages 115–128. Springer New York, 1986. doi: 10.1007/978-1-4757-1904-8 7.

[15] A. Kwanashie, R. W. Irving, D. F. Manlove, and C. T. S. Sng. Profile-based optimal matchings in the student/project allocation problem. In Combinatorial Algorithms, pages 213–225, 2015. doi: 10.1007/978-3-319-19315-1 19.

[16] D. F. Manlove, R. W. Irving, K. Iwama, S. Miyazaki, and Y. Morita. Hard variants of stable

[17] F. M´emoli. Metric structures on datasets: Stability and classification of algorithms. In Computer Analysis of Images and Patterns, pages 1–33. Springer Berlin Heidelberg, 2011. doi: 10.1007/978-3-642-23678-5 1.

[18] M. K. Ng, M. J. Li, J. Z. Huang, and Z. He. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (3):503–507, 2007. doi: 10.1109/TPAMI.2007.53.

[19] A. Olaode, G. Naghdy, and C. Todd. Unsupervised image classification by Probabilistic Latent Semantic Analysis for the annotation of images. In International Conference on Digital Image Computing: Techniques and Applications, 2014. doi: 10.13140/2.1.1909.4086.

[20] A. Roth. The evolution of the labor market for medical interns and residents: A case study in game theory. Journal of Political Economy, 92(6):991–1016, 1984. doi: 10.1086/261272.

[21] V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan. Finding a ‘kneedle’ in a haystack: Detecting knee points in system behavior. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops, pages 166–171, 07 2011. doi: 10.1109/ICDCSW.2011.20.

[22] S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007. ISSN 1574-0137. doi: 10.1016/j.cosrev.2007.05.001.

[23] N. Sharma and N. Gaud. k-modes clustering algorithm for categorical data. International Journal of Computer Applications, 127(17):1–6, 2015. doi: 10.5120/ijca2015906708.

[24] The Matching library developers. Matching: v1.1, 2019. URL http://dx.doi.org/10.5281/

[25] H. Wilde, V. Knight, and J. Gillard. Evolutionary dataset optimisation: learning algorithm quality through evolution. Applied Intelligence, 2019. doi: 10.1007/s10489-019-01592-4.

designed for accessibility and to further open science