b

DiscoverSearch
About
Fair Correlation Clustering
2020·arXiv
ABSTRACT
ABSTRACT

1 In this paper, we study correlation clustering under fairness constraints. Fair variants of k-median and k-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints.

Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We define a fairlet decomposition with cost similar to the k-median cost and this allows us to obtain approximation algorithms for a wide range of fairness constraints.

We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms.

There is a growing literature on fairness in various learning and optimization problems [34, 33, 31, 12, 46, 11, 17]. The goal of this literature is to develop criteria and algorithms to ensure that we can find solutions for optimization/learning problems that are fair with respect to a certain sensitive feature. In the case of clustering, a fundamental unsupervised learning and optimization problem, the study of fairness was initiated by Chierichetti et al. [16]. They formulated the notion of proportional fairness, and developed approximation algorithms for fair k-median and fair k-center under this notion of fairness. Follow-up work generalized their results to other clustering problems, such as k-means and facility location, and to more relaxed notions of fairness [8, 1, 7, 5, 36]. Notably, the important graph problem of correlation clustering has been so far not addressed by this literature.

Correlation clustering uses information about both similarity and dissimilarity relationships among a set of objects in order to cluster them [6]. In contrast to other clustering problems such as k-median, k-means, and k-center, the number of clusters is not pre-specified but rather determined based on the outcome of an optimization. This, as well as the fact that correlation clustering uses both similarity and dissimilarity information, makes it a desirable clustering model in many applications [38, 43, 4]. Therefore, it is natural to study this problem under fairness constraints.

The main tool introduced by Chierichetti et al. [16] for solving fair k-center and k-median problems is the notion of fairlets. A fairlet is a small set of elements that satisfies the fairness property. Chierichetti et al. [16] showed that fair k-median and k-center can be solved by first decomposing an instance into fairlets and then solving the clustering problem on the set of centers of these fairlets. To the best of our knowledge, this technique has been used only for metric space clustering problems such as k-center and k-median.

Our main result is developing a fairlet-based reduction for the graph clustering problem of correlation clustering. Whereas, in the case of k-center and k-median, the fairlet decomposition problem amounts to solving the same clustering problem on the same instance under the condition that each cluster is a fairlet, the situation for correlation clustering is complicated by the lack of the properties of metric spaces. To tackle this problem, we introduce a novel cost function for the correlation clustering fairlet decomposition, and prove that this cost can be approximated by a median-type clustering cost function for a carefully defined metric space.

Given a solution to this fairlet decomposition problem, we show that the fair correlation clustering instance can be reduced to a regular correlation clustering instance through a graph transformation. Therefore, any approximation algorithm for fairlet decomposition with median cost yields an approximation algorithm for fair correlation clustering; the loss in the approximation ratio depends on the size of the fairlets. We show that in many natural cases, there is a fairlet decomposition with small fairlets, thereby bounding the approximation ratio of our algorithm for fair correlation clustering.

In addition to the theoretical bounds on our algorithm, we provide an empirical evaluation based on real data sets, showing that the algorithms often perform much better in practice than their worst-case guarantees and that they yield solutions of costs comparable to that of unfair clustering algorithms while substantially reducing the cluster imbalance.

Related work. Clustering is a fundamental unsupervised machine learning task with a long history (cf. [30]). Our paper spans the areas of correlation clustering, clustering in metric spaces, and fairness in clustering which are actively growing fields. For brevity, we will only focus of key works in these three areas.

Correlation clustering. Correlation clustering is a widely studied formulation of clustering with both similarity and dissimilarity information [6], with many applications in machine learning [38, 9]. Variants of the problem include complete signed graphs [6, 4] and weighted graphs [19]. We focus on the complete graph case with  ±1weights which is APX-hard [13] but admits constant-factor algorithms [4, 14]. Distributed and streaming algorithms are also known [42, 3].

Metric space clustering. The most widely studied clustering setting is clustering in metric spaces consisting in minimizing the  ℓp-norm of the distances between points in a cluster and their center. For  p ∈ {1, 2, ∞}this corresponds to k-median, k-means, and k-center, respectively, which are NP-hard problems but admit constant-factor approximations [25, 28, 41, 2, 35].

Fairness in clustering. Fairness in machine learning is new area with a fast growing literature. Fundamental work in this area is devoted to defining notions of fairness [10, 20, 23, 33] and solving fairness-constrained problems [11, 12, 16, 31, 33, 46, 5, 34, 24, 1, 17, 27].

Chierichetti et al. [16] first introduced a notion of disparate impact for clustering and provided fair k-center algorithms for the case of two colors (or groups); see Section 2. Following this work, the problem has been later generalized in many directions including allowing many colors [44], allowing upper bounds on the fraction of points of a given color [1] and both upper and lower bounds [7, 8]. Backurs et al. [5] designed near-linear algorithms for finding k-median fairlets, and Huang et al. [29] designed core-sets for the problem. Other variants include clustering with diversity constraints [40], proportionality constraints [15], and fair center selection [36]. Fairness has been studied in spectral clustering as well [37, 47]. From an application point, fair clustering can be seen through the lenses of fair allocation [21], Medicaid eligibility [22], and ensuring protected group representations [18].

To the best of our knowledge no prior work has addressed correlation clustering with fairness constraints in the cluster elements distribution. Kalhan [32] recently studied a fairness notion in correlation clustering in which the maximum error for a vertex is bounded.

Correlation clustering. Let G = (V, E) be a complete undirected graph on |V | = n vertices and  σ : E �→ Rbe a function that assigns a label to each edge. The label  σ(e) for each eis either positive (indicating that the two endpoints of e are similar) or non-positive (indicating that they are dissimilar). In the unweighted version of the problem [6], σ(e) ∈ {−1, +1} for each e. Our focus in this paper is on the unweighted version, although we will use the weighted version in the proofs. Let  E+ = {e ∈ E | σ(e) > 0}be the set of positive edges and  E− = E \ E+be the set of non-positive edges. For subsets  S, T ⊆ V , let E(S) = E ∩ S2 denote the edges inside  S and E(S, T) = E ∩ (S × T)denote the edges between  S and T. Let E+(S, T) = E+ ∩ E(S, T) and E−(S, T) = E− ∩ E(S, T).

A clustering is a partitioning  C = {C1, C2, . . .}of V into disjoint subsets. The sets of intra-cluster and inter-cluster edges in a clustering C are defined as  intra(C) = �C∈C E(C) and inter(C) = E \ intra(C). Thecorrelation clustering cost of C is defined as:

image

In the unweighted version of the problem, this is simply COST(G, C) = |intra(C) ∩ E−| + |inter(C) ∩ E+|. The goalof correlation clustering2 is to find a clustering C to minimize COST(G, C). For unweighted correlation clustering, there are constant-factor approximation algorithms for this problem [6, 4, 14], with the best known constant being 2.06. For the weighted version, the best known algorithm obtains an O(log n)-approximation [19].

Fairness constraints. In the fair version of any clustering problem, each vertex  v ∈ Vhas a color c(v). Proportional fairness, defined by Chierichetti et al. [16], requires that in every cluster, the number of vertices of each color is proportional to the corresponding number in the whole graph. In particular, in the symmetric case where each color appears the same number of times in the graph, we require the same in each cluster. Ahmadian et al. [1] relaxed this property by requiring that each color constitutes at most an  α-fraction of each cluster, for a given  α ∈ (0, 1). Bera et al. [7] further generalized this notion to include lower bounds on the number of vertices of each color in each cluster.

We give a general reduction from fair correlation clustering to a median fairlet decomposition that works for any of these definitions of fairness, and in fact for a more general class of constraints. As long as the fairlet decomposition problem can be solved with small fairlets (holds for the above fairness definitions; see Section 4), this will give us an approximation algorithm for the corresponding fair correlation clustering problems.

In this section, we give a high-level overview of our algorithm and our main result. A key ingredient of our algorithm is a general reduction from the given constrained correlation clustering problem (as defined below) to a fairlet decomposition problem. We then show how the cost of a fairlet decomposition can be approximated by a median clustering cost function. This allows us to use previous results on the fair median problem to solve fairlet decomposition for the standard notions of fairness defined in the previous section. Finally, given an approximately optimal fairlet decomposition, we use our reduction to reduce the constrained correlation clustering instance to a standard correlation clustering instance, and apply known algorithms [6, 4, 14] to solve this problem.

Constrained correlation clustering. We start by defining a general class of constrained correlation clustering problems. Consider an unweighted correlation clustering instance G and let F be a family of subsets of V . We treat F as the family of feasible clusters, and assume it has the following composability property: for every  F1, F2 ∈ F, we have  F1 ∪ F2 ∈ F. Note this property is satisfied when F is the collection of all fair sets under any of the definitions of fairness given in Section 2. The constrained correlation clustering problem is to define a correlation clustering C with minimum COST(G, C) such that for all  C ∈ C, we have C ∈ F.

Fairlet decomposition. Next, we define the notion of fairlet decomposition used in our reduction. A fairlet decomposition for a constrained correlation clustering problem is simply a partition  P = {P1, P2, . . .}of V into subsets in F, i.e.,  Pi ∈ Ffor all i. We call each  Pia fairlet. The key in our reduction is a cost function FCOST that evaluates P’s usefulness in building a correlation clustering of G. Here we define this cost function, and in Section 4.1 we show how it can be approximated by the standard median clustering cost function in a carefully defined metric space.

Fairlet decomposition cost. Consider a fairlet decomposition  P = {P1, P2, . . .}. For each fairlet  Pi, we let FCOSTin(Pi)be the number of negative edges inside  Pi, i.e., FCOSTin(Pi) = |E− ∩ intra(Pi)|. For fairlets  Pi, Pj, welet let FCOSTout(Pi, Pj)be the number of edges between them with the minority sign, i.e.,

image

Finally, we let FCOSTin(P) = �i FCOSTin(Pi), FCOSTout(P) = �i<j FCOSTout(Pi, Pj), and FCOST(P) =FCOSTin(P) + FCOSTout(P).

Reduced instance. Given a constrained correlation clustering instance G and a fairlet decomposition P for G, we define a reduced correlation clustering instance as follows. Let  GP be a complete graph on  {p1, . . . , p|P|}, where vertex picorresponds to fairlet  Pi ∈ P. The label  σ(pi, pj)of the edge between  pi and pjis the majority sign of the edges in E(Pi, Pj)(with ties broken arbitrarily) multiplied by a weight that is equal to the number of edges in  E(Pi, Pj)with the majority sign.

Note that the instance  GP defined above is an instance of weighted correlation clustering, although as we will observe, the edges have weights that are within a constant factor each other, and therefore the problem can be solved using unweighted correlation clustering algorithms. Given a solution to this problem, it can be expanded into a solution of the original constrained problem. The final algorithm is sketched below.

image

To prove that the Algorithm 1 produces an approximately optimal solution to the constrained correlation clustering problem, we need the following lemmas. The first two lemmas prove that a solution of G can be transformed to a solution of  GPand vice versa, and these transformations do not increase the cost by more than the cost of the fairlet decomposition. The third lemma bounds the cost of a fairlet decomposition in terms of the cost of the optimal solution to the constrained correlation clustering problem. The proofs of these lemmas are presented in Appendix A. Lemma 3.1. Given a correlation clustering instance G, a fairlet decomposition P for G, and a clustering C of G, there exists a clustering  C′ of GP such that

COST(GP, C′) ≤ COST(G, C) + FCOSTout(P).Lemma 3.2. Let C be a clustering of  GP and C′ be the clustering computed in line 4 of Algorithm 1. Then,

COST(G, C′) ≤ COST(GP, C) + FCOST(P).Lemma 3.3. For any constrained correlation clustering instance G, and any constrained clustering C of G, there is a fairlet decomposition P of G satisfying FCOST(P) ≤ COST(G, C).

Putting these together, we have the following:

Theorem 3.4. Assume there is an  η-approximation algorithm  A1for finding the minimum cost fairlet decomposition P and a  β-approximation algorithm  A2for solving the unconstrained correlation clustering instance  GP. Then Algorithm 1 produces a  (β(1 + η) + η)-approximation for the constrained correlation clustering instance G.

Proof. Let OPT be an optimal solution to the constrained correlation clustering instance G. By Lemma 3.3, the fairlet decomposition problem has a solution of cost at most COST(G, OPT), and therefore, algorithm  A1for this problem must find a decomposition P with FCOST(P) ≤ η · COST(G, OPT). Also, by Lemma 3.1, the instance  GPhas a solution of cost at most  (1 + η) · COST(G, OPT). Therefore, algorithm  A2can find a clustering C of cost at most  β(1 + η) · COST(G, OPT). Thus, by Lemma 3.2, the cost of the clustering produced by Algorithm 1 is at most (β(1 + η) + η) · COST(G, OPT). Finally, by the composability property of the constraints, we know that this clustering satisfies the constraints, since each of its clusters is a union of fairlets in P.

In the following, we explain the approximation factor  βwe can get for solving unconstrained correlation clustering instance  GPand we dedicate the next section to approximation ratios  ηthat we can get for minimum cost fairlet decomposition problem depending on the fairness parameter  αand the number of colors in a given fair correlation clustering instance.

Lemma 3.5. There exists an approximation algorithm for unconstrained correlation clustering of  GP with approximation ratio of  β = min(log n, 2ρr2) where r = maxP ∈P |P |minP ∈P |P | and ρis the approximation factor of unweighted correlation clustering3.

Proof. Since the reduced correlation clustering instance is a weighted correlation clustering instance, there exists an O(log n)-approximation [19]. Now since the weight of the edge between  pi and pj in GP is at least  |Pi| · |Pj|/2 and atmost  |Pi| · |Pj|, any two edges weights are within  2r2of each other. So if we remove the weights from  GPand solve the resulting unweighted instance, we will get a  (2ρr2)-approximation.

In this section we show how to solve the fairlet decomposition problem by reducing it to a fair clustering problem with the median cost function in an appropriate metric space. The reduction (Section 4.1), loses a factor that is proportional to the size of the largest fairlet, but as we show in Section 4.2, in cases that we know how to solve the fairlet decomposition problem, the size of the fairlets can be guaranteed to be small.

4.1 Reduction to median cost

Consider a correlation clustering instance G and let d be a distance function defined on a metric space M containing the set of vertices V . For a fairlet decomposition  P = {P1, P2, . . .}, we define the following median cost: MCOST(Pi) = minu∈M�v∈Pi d(u, v) and MCOST(P) = �Pi∈P MCOST(Pi). Notice that the problem of finding the fairlet decomposition with minimum MCOST(P) is precisely the fairlet decomposition problem for fair k-median, as studied by [16, 7].

We now define a metric space (M, d) such that the median cost MCOST can approximate the fairlet cost FCOST. We first define an embedding  φ : V → [0, 1]n as follows. For a vertex  u ∈ V , let

image

In other words,  φ(v) is the vth row of the adjacency matrix of  G(V, E+)after adding a positive self-loop at every vertex. Now we let  M = [0, 1]n, and place the vertices in V in this space using the mapping  φ. We let d(·, ·)be the Hamming distance between the points in M. In other words, for vertices  u, v ∈ V , we have d(u, v) = |φ(u) − φ(v)|. Intuitively, d(u, v) measures the “cost” of committing to put u and v in one cluster in the correlation clustering instance.

We now prove the following two lemmas, which show that the FCOST of a fairlet decomposition is close to its MCOST with respect to the metric d. The proofs of these lemmas are in the Supplementary Material. Lemma 4.1. For any fairlet decomposition P, we have

image

Using the above lemmas, we have the following.

Theorem 4.3. Assume there is a  γ-approximation algorithm for fairlet decomposition with median costs. Furthermore, assume that this algorithm always produces fairlets of size at most f. Then the solution produced by this algorithm is a (4fγ)-approximation to the problem of finding a fairlet decomposition with minimum FCOST.

4.2 Algorithms for fairlet decomposition

In this section, we give algorithms for fairlet decomposition with median cost for several notions of fairness. Using Theorem 4.3, these algorithms imply algorithms for fairlet decomposition problem and provide the algorithm  A1in Theorems 3.4. We focus on three fairness constraints: an upper bound of  α = 12on the fraction of vertices of each color in each cluster; an upper bound of  α = 1/Cwhere C is the number of distinct colors (this corresponds to the proportional fairness property studied in [16]); and an upper bound of  α = 1/tfor an integer t on the fraction of vertices of each color in each cluster. We give approximation algorithms for the first two cases and a bicriteria approximation for the third case, with upper bounds of  3, C, and 2t − 1, respectively, on the size of fairlets.

Throughout this subsection, when we speak of the cost of a fairlet decomposition, we mean its median cost.

4.2.1 α = 1/2

This is probably the most common case of fair decomposition where clusters are required to not have a dominant color. In this case, we can show that a fairlets have size at most 3 and find these fairlets by solving a minimum weight 2-factor problem in a graph. Recall that a 2-factor is a subgraph where each vertex has degree 2 and edges may be used multiple times. This problem can be solved polynomially [45, Chapter 21]. Define a graph H on points in V as follows: two vertices u, v are connected by an edge if they have distinct colors; the weight of the edge is d(u, v). We first bound the cost of the optimal 2-factor and then explain our approximation factor in the lemma.

Lemma 4.4. The cost of an optimal 2-factor in H can be bounded by  2 · MCOST(P∗), where P∗ is the optimal fairlet decomposition.

Proof. We construct a feasible 2-factor by constructing a 2-factor for each fairlet  P ∈ P∗ with center  µ. Since there are at most |P|/2 vertices of any color, depending on the parity of P, vertices of P can be covered by matching and a possible multi-color triangle in H. Doubling the matching edges, we can get a 2-factor for covering  P∗. It remains to bound the cost of this 2-factor. For a matching edge  (u, v) for u, v ∈ P, by triangle inequality,  d(u, v) ≤ d(u, µi) + d(v, µi), andfor a triangle (u, v, w), the sum of pairwise distances can be bounded by  2(d(u, µ) + d(v, µ) + d(w, µ)). Hence the cost of the proposed 2-factor for covering  P∗ is at most 2 · MCOST(P∗)and the overall cost of the optimal 2-factor is at most  2 · MCOST(P∗).

Lemma 4.5. For  α = 1/2, there is an approximation algorithm for fairlet decomposition that returns a solution with

image

Proof. Consider an optimal 2-factor in H. Define a fairlet decomposition as follows. For each cycle of even length, consider a set of alternating edges and let each alternating edge be a fairlet with one of the endpoints chosen as the center. For a cycle of odd length, there must exists three consecutive vertices of pairwise distinct colors. In this case, let one fairlet be these three vertices with the middle vertex chosen as the center and for the (unique) alternating edges covering the remaining vertices, let each edge be a fairlet with one of the endpoints chosen as the center. The median cost of these fairlets is at most the weight of the original 2-factor, which is at most  2 · MCOST(P∗)by Lemma 4.4; the proof follows by construction.

Lemma 4.5 and Lemma 3.5 yield the following.

Theorem 4.6. For  α = 1/2, there is a 256-approximation algorithm for fair correlation clustering.

4.2.2 α = 1/C for C colors

In the case of  α = 1/C, each fairlet has equal number of points of each color and so these points can be matched together. We use this observation and devise an algorithm based on solving repeated matching problems in graph H (construction explained in Section 3).

Lemma 4.7. For  α = 1/C, there is an approximation algorithm for fairlet decomposition that returns a solution with

image

Proof. Consider an arbitrary ordering of the colors and solve a mincost matching problem between points of color c and c + 1 in the graph H. The union of these matchings yields a partition of V into paths of length C. Each such path is a fairlet; let P denote this fairlet decomposition. It remains to bound the cost of P.

Let us fix two colors c and c + 1 and let M be an arbitrary matching between vertices of color c and c + 1 such that point u is matched to a point v only if u and v belong to the same partition of  P∗. Since each part in  P∗has equal number of vertices of each color and there is an edge between any two vertices of different colors in H, matching M exists. Now since the cost of each matching edge (u, v) can be bounded by  d(u, µ) + d(v, µ) where µis the center of the partition containing u and v, the cost of M can be bounded by median cost of serving clients of colors c and c + 1. Since each color is matched twice, the total cost of each path corresponding to a partition is at most  2 · MCOST(P∗). Now for each path we pick the middle vertex as center and the cost of assigning vertices of the path to the center is at most C/2 cost of the path as each edge is charged at most C/2 times. Hence MCOST(P) ≤ C · MCOST(P∗).

So in this case, we get a 2-approximation with fairlets of size at most C. Note that in the motivating application of fair clustering, the color of each vertex corresponds to one possible value of a sensitive feature like race or gender, and therefore the value C tends to be small in such applications.

Lemma 4.7 and Lemma 3.5 yield the following.

Theorem 4.8. For  α = 1/C, there is a  (16.48C2)-approximation algorithm for fair correlation clustering.

4.2.3 α = 1/t

The fair decomposition with median cost is not well studied in the literature and in this paper, we were able to devise algorithms for the case of  α = 1/2and  α = 1/C. Next we consider the case where  1/α ∈ Z+and we argue how to utilize any approximation algorithm for fairlet decomposition as a black-box to build an algorithm for fair correlation clustering. While we allow the black-box algorithm to produce fairlets of arbitrary size, the following lemma ensures that the size of the fairlets can be bounded.

Lemma 4.9. For any set P that satisfies fairness constraint with  α = 1/t, there exists a partition of P into sets (P1, P2, . . .)where each  Pisatisfies the fairness constraint and  t ≤ |Pi| < 2t.

Proof. Let  p = m × t + r with 0 ≤ r < t. Then, the fairness constraints ensures that there are at most m elements of each color. Consider the partitioning obtained through the following process: consider an ordering of elements where

image

Table 1: ERROR and IMBALANCE in C = 2 color case for various datasets and different threshold  θfor the quantile used for positive edges. Notice how our algorithm MATCH + LOCAL has cost comparable to PIVOT and not much higher than LOCAL while reducing the imbalance from the up 65% of the unfair algorithms to 0.

points of the same color are in consecutive places, assign points to sets  P1, . . . , Pmin a round-robin fashion. So each set  Pigets at least t elements and at most t + r < 2t elements assigned to it. Since there are at most m elements of each color, each set gets at most one point of any color and hence all sets satisfy the fairness constraint as  1 ≤ 1t · |Pi|.

Theorem 4.10. For  α = 1/t, given an γ-approximation algorithm for fairlet decomposition with median cost, there is an  O(tγ)-approximation algorithm for fair correlation clustering.

In this section we present our experiments demonstrating that our algorithm solves the correlation clustering problem with fairness constraints with only a limited loss in the cost when compared to the vanilla (unfair) solution. We describe the datasets used, the algorithms evaluated, the quality measures, and our results.

Datasets. We use publicly-available datasets from the UCI Repository4 and from the SNAP Datasets5.

The datasets represent complete signed graphs from different domains, including both co-purchasing relationships among products, and semantic similarities among texts learned with embedding methods. The graphs are represented by complete signed matrices up to 1600x1600 in size, up to 0.9 million positive edges, and up to C = 16 colors. Here we only briefly describe the datasets; for details see Supplementary Material.

amazon: Vertices represents products on the Amazon website [39], the color is the item category, and two co-reviewed items have a +1 weight edges (all non-co-reviewed items have  −1weight edges). We use 1000 vertices equally distributed in two popular categories.

reuters and victorian: These are extracted from text data used in previous fair clustering work [1]. The datasets include between 50 and 100 English language texts from each of up to 16 authors.6 Each vertex represents a text and the color represents the author. For each text we obtain a semantic embedding vector with standard methods. We use a threshold on the dot product of the embedding vectors to obtain the edges. Through this operation, we set the top θ ∈ {0.25, 0.50, 0.75}fraction of edges via dot products as +1’s, and the remaining edges are labeled  −1.

Algorithms. We evaluate Algorithm 1 in two fairness scenarios: with an upper bound of  α = 1/2of the vertices for each color, and with  α = 1/Cfor equal color representation (for the two-color case, the two are equivalent). For the  α = 1/2case, in our experiments we simplify the algorithm of Section 4.2.2 to compute a minimum-cost perfect matching (1-factor) instead of a 2-factor decomposition. This can be formally shown to be sufficient for the C = 2 case, or when all optimal clusters are even sized, and we observe it works well in practice in our experiments. For  α = 1/C,we implement the repeated matching algorithm in 4.2.2 to obtain the fairlets in a similar fashion. After finding the fairlets, we use an in-house correlation clustering solver based on local search (LOCAL). We refer to our algorithms as MATCH + LOCAL for the  α = 1/2case and as REP. MATCH + LOCAL for the  α = 1/C case.

image

Table 2: Experimental results for  victorian, θ = 0.50, using C = 8 colors.

We also consider the following (unfair) baseline algorithms: the standard PIVOT algorithm of Ailon et al. [4] for correlation clustering (for PIVOT, we repeat the randomized algorithm 10 times and use the best result), and the (unfair) local search heuristic LOCAL used as part of our algorithm. In addition, as no prior work has addressed the fair correlation clustering problem, we compare our algorithm with two simple fair baselines: the whole graph as one cluster SINGLE, and a random fairlet decomposition RAND.

Quality measures. For each algorithm, we report the following measures. The ERROR of the correlation cost of the clustering obtained by the algorithm, presented as the ratio of the edges of the graph that are in disagreement with the clustering (i.e., inter-cluster positive edges and intra-cluster negative edges) over the number of edges (i.e., 0 corresponds to a perfect solution, whereas 1 corresponds to a completely incorrect solution). For fairness, we report the IMBALANCE as the total fraction of vertices that violate the  αcolor representation constraint, i.e., vertices for each cluster and color that are above the  αfraction for the size of the cluster [1]. More precisely, let P be a cluster in the solution of an  α-color constraint instance. The maximum allowed number of points of a certain color in the cluster P is  ⌊|P|α⌋. Let  Vcbe the vertices of color c on the graph and  ∆P = �P,c max(|P ∩ Vc| − ⌊|P|α⌋, 0)be the vertices violating the constraint in P. We report �P ∆P /|V | as the IMBALANCE for α ∈ {1/2, 1/C}(i.e., 0 corresponds to no imbalance, and 1 corresponds to complete imbalance). We repeat all algorithms 10 times and report the mean results for all measures.

5.1 Results for C = 2

Table 1 shows the results for the C = 2 color case with  α = 1/2, i.e., equal representation over the various datasets. We use the MATCH + LOCAL as a fair algorithm, which has an IMBALANCE of 0 by construction.

The table shows clearly that our algorithm MATCH + LOCAL obtains clusters that are fair and has costs comparable to the unfair PIVOT baseline (and sometimes better) and slightly worse than the LOCAL baseline. On average, over all datasets, our algorithm has an average cost of  0.234 vs 0.193 for PIVOT and 0.139 for LOCAL. On the other hand, our algorithm is significantly better than both SINGLE and RAND, which, while satisfying fairness, have exorbitantly high costs.

Notice how the LOCAL and the PIVOT baselines have very high IMBALANCE values of up to 65% of the vertices (as they are oblivious to colors), showing the importance of developing novel algorithms for the problem. This result is not surprising. If pairs of vertices of the same color are more likely to be similar, it is expected that many clusters will contain vast majorities of points with a single color.

5.2 Results for C > 2

Here, we study the behavior of our algorithm in the case when more than two colors are present in the dataset. We use MATCH + LOCAL to obtain a  α = 1/2fair solution and REP. MATCH + LOCAL to obtain a  α = 1/C(equal representation) solution.

We report an overview of our experimental results in Table 2 for the dataset victorian with threshold  θ = 0.50and C = 8 colors (note that this is a different graph than that produced with the previous C = 2 dataset). More experimental results are available in Supplementary Material.

It is easy to see that all the earlier trends continue to hold. Notice how the algorithm for the  1/2 case MATCH + LOCALis only marginally worse than the best unfair solution LOCAL and much better than all other baselines. Our algorithm for the more difficult equal representation case is again better than the PIVOT baseline. Notice how all unfair algorithms are significantly far from being equally balanced. However, the presence of many colors makes it easier to get closer to the 1/2 threshold.

image

Figure 1: ERROR of our algorithms over that of the unfair LOCAL algorithms for  α = 1/2 and α = 1/C, on a series of graphs from  victorian, θ = 0.50, and using C = 2 to C = 16 colors. .

We confirm this observation in Figure 1. The figure shows a comparison of the ERROR for our algorithms for  α = 1/2and  α = 1/Cvs the ERROR for the vanilla LOCAL algorithms as the number of colors goes from C = 2 to C = 16. We report the ratio of MATCH + LOCAL over LOCAL as a solid line, and the ratio for REP. MATCH + LOCAL as a dashed line. Notice how the error for our algorithm for the  α = 1/2case gets closer and closer to the LOCAL output error for more colors. Again, this result is obtained because the presence of many colors makes the problem easier for the  α = 1/2case. The performance of the algorithm for the  α = 1/Ccase has a less stable pattern, but it confirms that the algorithm is quite competitive with the unfair solution (between  30 − 80%higher error) even for quite a few colors. These results are significantly better than what is expected from a worst-case analysis.

Finally, we report having observed that using just MATCH or REP. MATCH fairlets without the re-clustering part of the algorithm is not sufficient for obtaining good results, as the re-clustering step is needed for obtaining clusters that do not have large errors.

In this paper we initiated the study of correlation clustering with fairness constraints. We showed a reduction to the fairlet decomposition problem with a standard median cost function, for a carefully chosen distance function. Using this, and old and new results on the fairlet decomposition problem with a median cost function, we obtained provable constant-factor approximation algorithms for fair correlation clustering for various notions of fairness. Our experimental evaluation shows that these algorithms perform well not only in theory but also in practice.

[1] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. Clustering without over-representation. In KDD, pages 267–275, 2019.

[2] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In FOCS, pages 61–72, 2017.

[3] K. Ahn, G. Cormode, S. Guha, A. McGregor, and A. Wirth. Correlation clustering in data streams. In ICML, pages 2237–2246, 2015.

[4] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1–23:27, 2008.

[5] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner. Scalable fair clustering. In ICML, pages 405–413, 2019.

[6] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Mach. Learn., 56(1-3):89–113, 2004.

[7] S. K. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In NeurIPS, pages 4955–4966, 2019.

[8] I. O. Bercea, M. Gross, S. Khuller, A. Kumar, C. Rösner, D. R. Schmidt, and M. Schmidt. On the cost of essentially fair clusterings. In APPROX-RANDOM, pages 18:1–18:22, 2019.

[9] M. Bressan, N. Cesa-Bianchi, A. Paudice, and F. Vitale. Correlation clustering with adaptive similarity queries. In NeurIPS, pages 12510–12519, 2019.

[10] T. Calders and S. Verwer. Three naive bayes approaches for discrimination-free classification. DMKD, 21(2):277– 292, 2010.

[11] L. E. Celis, L. Huang, and N. K. Vishnoi. Multiwinner voting with fairness constraints. In IJCAI, pages 144–151, 2018.

[12] L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. In ICALP, pages 28:1–28:15, 2018.

[13] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. JCSS, 71(3):360–383, 2005.

[14] S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near optimal LP rounding algorithm for correlation clustering on complete and complete k-partite graphs. In STOC, pages 291–228, 2015.

[15] X. Chen, B. Fain, C. Lyu, and K. Munagala. Proportionally fair clustering. In ICML, pages 1032–1041, 2019.

[16] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017.

[17] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Matroids, matchings, and fairness. In AISTATS, pages 2212–2220, 2019.

[18] A. Dash, A. Shandilya, A. Biswas, K. Ghosh, S. Ghosh, and A. Chakraborty. Summarizing user-generated textual content: Motivation and methods for fairness in algorithmic summaries. In CSCW, pages 1–28, 2019.

[19] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. TCS, 361(2-3):172–187, 2006.

[20] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In ITCS, pages 214–226, 2012.

[21] H. Elzayn, S. Jabbari, C. Jung, M. Kearns, S. Neel, A. Roth, and Z. Schutzman. Fair algorithms for learning in allocation problems. In FAT, pages 170–179, 2019.

[22] B. Fang, M. Jiang, and J. Shen. Achieving fairness in determining medicaid eligibility through fairgroup construction. arXiv:1906.00128, 2019.

[23] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In KDD, pages 259–268, 2015.

[24] B. Fish, J. Kun, and Á. D. Lelkes. A confidence-based approach for balancing fairness and accuracy. In SDM, pages 144–152, 2016.

[25] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. TCS, 38:293–306, 1985.

[26] A. Gungor. Fifty victorian era novelists authorship attribution data. 2018.

[27] S. Har-Peled and S. Mahabadi. Near neighbor: Who is the fairest of them all? arXiv:1906.02640, 2019.

[28] D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. MOR, 10(2):180–184, 1985.

[29] L. Huang, S. H. C. Jiang, and N. K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019.

[30] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

[31] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In NIPS, pages 325–333, 2016.

[32] S. Kalhan, K. Makarychev, and T. Zhou. Improved algorithms for correlation clustering with local objectives. In NeurIPS, pages 9341–9350, 2019.

[33] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Fairness-aware classifier with prejudice remover regularizer. In PKDD, pages 35–50, 2012.

[34] T. Kamishima, S. Akaho, and J. Sakuma. Fairness-aware learning through regularization approach. In ICDMW, pages 643–650, 2011.

[35] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89–112, 2004.

[36] M. Kleindessner, P. Awasthi, and J. Morgenstern. Fair k-center clustering for data summarization. In ICML, pages 3458–3467, 2019.

[37] M. Kleindessner, S. Samadi, P. Awasthi, and J. Morgenstern. Guarantees for spectral clustering with fairness constraints. In ICML, pages 3448–3457, 2019.

[38] S. Kushagra, S. Ben-David, and I. Ilyas. Semi-supervised clustering for de-duplication. arXiv:1810.04361, 2018.

[39] J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. TWEB, 1(1):5, 2007.

[40] J. Li, K. Yi, and Q. Zhang. Clustering with diversity. In ICALP, pages 188–200, 2010.

[41] S. Li and O. Svensson. Approximating k-median via pseudo-approximation. SICOMP, 45(2):530–547, 2016.

[42] X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramachandran, and M. I. Jordan. Parallel correlation clustering on big graphs. In NIPS, pages 82–90, 2015.

[43] J. Pouget-Abadie, K. Aydin, W. Schudy, K. Brodersen, and V. Mirrokni. Variance reduction in bipartite experiments through correlation clustering. In NeurIPS, pages 13288–13298, 2019.

[44] C. Rösner and M. Schmidt. Privacy preserving clustering with constraints. In ICALP, pages 96:1–96:14, 2018.

[45] A. Schrijver. Combinatorial Optimization: Polyhedra and Efficiency, volume 24. Springer Science & Business Media, 2003.

[46] K. Yang and J. Stoyanovich. Measuring fairness in ranked outputs. In SSDBM, pages 22:1–22:6, 2017.

[47] I. M. Ziko, E. Granger, J. Yuan, and I. B. Ayed. Clustering with fairness constraints: A flexible and scalable approach. arXiv:1906.08207, 2019.

Lemma 3.1. Given a correlation clustering instance G, a fairlet decomposition P for G, and a clustering C of G, there exists a clustering  C′ of GP such that

image

Proof. To show existence of the claimed clustering  C′, we devise a randomized algorithm and bound the expected cost of the clustering output of this algorithm. We abuse notation and let clustering  C′be the output of this randomized algorithm on  GP. Given fairlet decomposition P, the algorithm first picks a representative  rifor each partition  Pi in Puniformly at random. Next the algorithm defines clustering  C′based on where  riis assigned in C: if  riis placed in cluster  Cj, it places the vertex  pi of GP in the cluster  C′j.

Next, we bound the expected cost of  C′in  GP. Let us fix two vertices  piand  pjin  GP. We first consider the case σ(pi, pj) > 0; the other case follows by a similar argument. The clustering  C′incurs a cost of  |σ(pi, pj)|if  piand  pjare assigned to different clusters, which happens when the two representatives  riand  rjare in different clusters in C. Now, if  σ(ri, rj) = +1, then C also pays a cost of 1 for separating  riand  rjand if  σ(ri, rj) = −1, then the edge (ri, rj)is an edge with the minority sign and it contributes to FCOSTout(Pi, Pj). Hence, if we denote the cost of the edge between  pi and pjin the clustering  C′ by COSTC′(pipj), we can bound the expected value of this cost as follows:

image

where 1 (A) is an indicator function having value 1 if A is true and zero otherwise. Now since  riand  rjare picked uniformly at random,

image

This follows from the fact that there are  |Pi| · |Pj|many possible pairs  (ri, rj)to be selected. Summing the above over all  pi, pj and using |σ(pi, pj)| ≤ |Pi| · |Pj|, we get

image

Lemma 3.2. Assume C is a clustering of  GP, and let  C′be the clustering computed in line 4 of Algorithm 1 for G. Then we have

image

Proof. Any edge (u, v) contributing to the cost of the clustering  C′ is either a negative edge inside a fairlet or an edge between two fairlets that are clustered in disagreement with  σ(u, v) in C. Negative edges inside fairlets are counted in FCOSTin(P). An edge (u, v) between fairlets  Piand  Pjthat is clustered in disagreement with  σ(u, v)either has the same sign as the majority sign of  E(Pi, Pj), or as the minority sign of  E(Pi, Pj). The edges in the former case are counted in COST(GP, C), and the edges in the latter case are counted in FCOSTout(P). Therefore, the total cost of  C′ isat most COST(GP, C) + FCOSTin(P) + FCOSTout(P).

Lemma 3.3. For any constrained correlation clustering instance G, and any constrained clustering C of G, there is a fairlet decomposition P of G satisfying FCOST(P) ≤ COST(G, C).

Proof. We can simply take P to be the same as the clustering C. It is easy to observe that this is a valid fairlet decomposition. To bound FCOST(P), it is enough to note that each edge counted in FCOSTin(P)also imposes a cost of 1 in C (as it is a negative edge inside a cluster), and for any two clusters  Ciand  Cj, the number of positive edges between  Ciand  Cjis at least FCOSTout(Ci, Cj). Summing over all these inequalities, we obtain that FCOST(P) is at most COST(G, C).

Lemma 4.1. For any fairlet decomposition P, we have

image

Proof. Consider a fairlet  Piin P. We define a vector  µ ∈ [0, 1]nindexed by the vertices of G as follows:  µu =majority({φ(v)u : v ∈ Pi}). By the definition of MCOST, we have

image

On the other hand, for every vertex u, if we denote  N −(u) = {v : (u, v) ∈ E−} and N +(u) = {u}∪{v : (u, v) ∈ E+},we have

image

For  u ̸∈ Pi, the above quantity is precisely FCOSTout(Pi, {u}). For u ∈ Pi, the above quantity is at most  |N −(u) ∩ Pi|,which is the number of negative edges in  Piincident on u. Therefore, the sum of this quantity over all u can be bounded by

image

Finally, by the definition of FCOSTout, we have FCOSTout(Pi, S) + FCOSTout(Pi, T) ≤ FCOSTout(Pi, S ∪ T) for anytwo disjoint sets S and T. Therefore,

image

Combining this with Equations (1) and (2), we get MCOST(P) ≤ 2· FCOSTin(P)+ FCOSTout(P) ≤ 2· FCOST(P).

Lemma 4.2. Let P be any fairlet decomposition and let  f = maxP ∈P |P|. Then,

image

Proof. Consider a fairlet  Piand define the vector  µas in the proof of Lemma 4.1. It is easy to see that

image

As in the proof of Lemma 4.1, for every  u ̸∈ Pi, the summand in the above expression is precisely FCOSTout(Pi, {u}).For  u ∈ Pi, this quantity is zero if u has no negative edge to any other vertex in  Pi, and is at least 1 otherwise. Therefore, since  |Pi| ≤ f, this quantity is always at least the number of negative edges from u to other vertices in  Pidivided by f. Therefore,

image

Summing over all i and rearranging the terms, we obtain:

image

Now, we fix i < j and bound the summand in the above expression. Without loss of generality, we assume  |Pi| ≥ |Pj|.We consider the following cases:

Case 1: There are at most  |Pi|/2vertices u in  Piwith FCOSTout(Pj, {u}) = 0. In this case, we have

image

Case 2: There are at least  |Pi|/2vertices u in  Piwith FCOSTout(Pj, {u}) = 0. Let S be the set of such vertices. By the definition of FCOSTout(Pj, {u}), any  u ∈ Smust have either positive edges to all vertices in  Pj, or negative edges to all of them. Assume x vertices in S have positive edges to all vertices in  Pjand y of them have negative edges, for some  x, y with x + y = |S| ≥ |Pi|/2. We further consider the following cases:

Case 2a: If x = 0, then every vertex  u in Pjhas at least  Pi/2positive edges to vertices in  Pi(namely, at least to those in S). Therefore, FCOSTout(Pi, {u}) = |E−∩E(Pi, {u})|. Thus, �u∈Pj FCOSTout(Pi, {u}) =|E− ∩ E(Pi, Pj)| ≥ FCOSTout(Pi, Pj).

Case 2b: If y = 0, an argument similar to case 2a shows that �u∈Pj FCOSTout(Pi, {u}) = |E+ ∩E(Pi, Pj)| ≥ FCOSTout(Pi, Pj).

Case 2c: If  x ≥ 1and  y ≥ 1, then each vertex in  Pjhas at least one positive edge and at least one negative edge to  Pi. Therefore, for every  u ∈ Pj, FCOSTout(Pi, {u}) ≥ 1. Thus, �u∈Pj FCOSTout(Pi, {u}) ≥

image

This, together with (3), implies:

image

Theorem 4.6. For  α = 1/2, there is a 256-approximation algorithm for fair correlation clustering.

Proof. From Theorem 4.3 and Lemma 4.5, we get a 24-approximation algorithm for solving fairlet decomposition with minimum FCOST (f = 3, γ = 2). From Lemma 3.5, there is a  2 · 2.06 · (1.5)2 = 9.27-approximation algorithm for unconstrained correlation clustering. Combining, we get a 255.75-approximation algorithm for fair correlation clustering.

Theorem 4.8. For  α = 1/C, there is a  (16.48C2)-approximation algorithm for fair correlation clustering.

image

Table 3: Experimental results for amazon, C = 4 colors.

Proof. From Theorem 4.3 and Lemma 4.5, we get a  4C2-approximation algorithm for solving fairlet decomposition with minimum FCOST (f = C, γ = C). From Lemma 3.5, there is a  2 · 2.06 · 1 = 4.12-approximation algorithm for unconstrained correlation clustering. Combining, we get a  (16.48C2)-approximation algorithm for fair correlation clustering.

Theorem 4.10. For  α = 1/t, given an  γ-approximation for fair decomposition with median cost, there exists an O(tγ)-approximation algorithm for fairlet correlation clustering.

Proof. Let P be the output of the output of the  γ-approximation algorithm on the metric space (M, d) obtained from correlation instance G. Let fairlet decomposition  P′be obtained from P by applying Lemma 4.9 and assigning each fairlet to a center minimizing the median cost of the fairlet. Since dedicating a center to a subset of points assigned to the same center in P can only decrease the median cost, MCOST(P′) ≤ MCOST(P). From Theorem 4.3 and Lemma 4.5, there is a  ((8t − 4)γ)-approximation algorithm for solving fairlet decomposition with minimum FCOST (f = 2t−1, γ = γA). Since the size of each fairlet is at least t, applying Lemma 3.5, there is a  2·2.06·( 2t−1t )2 < 16.48-approximation algorithm for solving unconstrained correlation clustering. Now applying Theorem 3.4, we get an O(tγ)-approximation algorithm for fair correlation clustering.

Here, we report additional experimental results.

C.1 Description of the datasets

We describe more in detail the datasets used.

amazon: Vertices represents products on the Amazon website [39] and positive edges connect products co-reviewed by the same user (all missing edges are treated as negative). We set the color of each item to its category. Further, we use 1000 vertices equally distributed among 2 popular book categories Nonfiction and Literature & Fiction for a total of ∼ 106,000positive edges.

reuters: This graph is extracted from a dataset, which was used in previous fair clustering work [1] and includes 50 English language articles from each of up to 16 authors (for a total of up to 800 texts).This dataset is available at archive.ics.uci.edu/ml/datasets/Reuter_50_50. We transform each text into a 10-dimensional vector using Gensim’s Doc2Vec with standard parameters, as in previous work [1], and we create one vertex for each text. Then we use a threshold on the dot product of the embedding vectors. Through this operation, we set the top θ ∈ {0.25, 0.50, 0.75}fraction of edges via dot products as +1’s ,and the remaining edges are assigned  −1’s. Note that the colors represent the text authors.

victorian: Similarly, for the victorian dataset, available at archive.ics.uci.edu/ml/datasets/Victorian+Era+ Authorship+Attribution. We use texts from up to 16 English-language authors from the Victorian era. Each text consists of 1,000-word sequences obtained from a book written by one of these authors (we use the training dataset). The data was extracted and processed in [26]. From each document, we extract a 10-dimensional vector using Gensim’s Doc2Vec with the standard parameter settings again, and we assign the author id as color, as in prior work [1]. We use 100 texts from each author, create one vertex for each text, and set the top  θ ∈ {0.25, 0.50, 0.75}fraction of pairwise dot product edges as positive, and the remaining edges as negative. All graphs are unweighted and complete.

image

Table 4: Experimental results for  reuters, θ = 0.50, C = 8 colors.

image

Table 5: Experimental results for various datasets and number of colors.

C.2 Other experimental results

In Table 3 we report an overview of the results of the various algorithms for a dataset extracted from Amazon involving 250 vertex for each of 4 colors corresponding to the book categories Literature & Fiction, Nonfiction, Business & Investing, Computers & Internet.

Similarly in Table 4 we report the results for  reuters, θ = 0.50, c = 8 colors.

Finally, in Table 5 we report an evaluation of our algorithms in a variety of datasets and for different number of colors.

Notice how in all cases the results matches qualitatively the results reported in the main paper.

Additional baselines. We further experimented with two other greedy baselines. First, we tried the following (unfair) greedy baseline: in an arbitrary order, iterate over the vertices, and for each vertex, add it to either the current cluster with most positive neighbors (if it exists) or to a singleton cluster. More precisely, we assign the vertex to the best current cluster, if it is connected with more positive edges than negative edges to it, otherwise we leave the vertex as a singleton. Unsurprisingly, this unfair baseline is worse than all other unfair baselines we considered in terms of error and it has also a large imbalance, so we omit the results.

We also tested a fair greedy baseline for  α = 12, for C = 2: sort all pairs of different color vertices by distance in the Hamming space in an increasing order, and assign vertices to clusters of size 2 with a greedy matching algorithm over this order. This creates fair clusters but again, we observe that this baseline to be close to that of RAND and as such we omit the results.

Running time. All experiments have been conducted on commodity hardware. Each run of an algorithm completed in less than an hour. In our experiments, our fair algorithms have a running time in the same order of magnitude of that of the local search heuristic. For instance, for  reuters, θ = 0.50, the ratio of mean running time of MATCH +LOCAL and REP. MATCH + LOCAL w.r.t. LOCAL was 90% and 29%, respectively. For  victorian, θ = 0.50it was 123% and 41%, respectively.

image

Figure 2: For partition  P = (P1, P2, P3), graphs G and GP are demonstrated; negative edges are red and positive edges are blue. In this example, FCOSTin(P1) = FCOSTin(P3) = 1, FCOSTin(P2) = 0 and FCOSTout(Pi, Pj)are shown as a weight of edge  (Pi, Pj) in GP.


Designed for Accessibility and to further Open Science