Near Neighbor: Who is the Fairest of Them All?

2019·arXiv

Abstract

1. Introduction

Nowadays, many important decisions, such as college admissions, offering home loans, or estimating the likelihood of recidivism, rely on machine learning algorithms. There is a growing concern about the fairness of the algorithms and creating bias toward a specific population or feature [HPS16, Cho17, MSP16, KLL]. While algorithms are not inherently biased, nevertheless, they may amplify the already existing biases in the data. Hence, this concern has led to the design of fair algorithms for many different applications, e.g., [DOBDABD

Bias in the data used for training machine learning algorithms is a monumental challenge in creating fair algorithms [HGB]. Here, we are interested in a somewhat different problem, of handling the bias introduced by the data-structures used by such algorithms. Specifically, data-structures may introduce bias in the data stored in them, and the way they answer queries, because of the way the data is stored and how it is being accessed. Such a defect leads to selection bias by the algorithms using such data-structures. It is natural to want data-structures that do not introduce a selection bias into the data when handling queries.

The target as such is to derive data-structures that are bias-neutral. To this end, imagine a data-structure that can return, as an answer to a query, an item out of a set of acceptable answers. The purpose is then to return uniformly a random item out of the set of acceptable outcomes, without explicitly computing the whole set of acceptable answers (which might be prohibitively expensive).

Several notions of fairness have been studied, including (where demographics of the population is preserved in the outcome) and individual fairness (where the goal is to treat individuals with similar conditions similarly) [DHP]. In this work, we study the near neighbor problem from the perspective of individual fairness.

Near Neighbor is a fundamental problem that has applications in many areas such as machine learning, databases, computer vision, information retrieval, and many others, see [SDI06, AI08] for an overview. The problem is formally defined as follows. Let pM, dq be a metric space. Given a set points and a parameter r, the goal of the near neighbor problem is to preprocess P, such that for a query point q P M, one can report a point p P P, such that dif such a point exists. As all the existing algorithms for the exact variant of the problem have either space or query time that depends exponentially on the ambient dimension of M, people have considered the approximate variant of the problem. In the c-approximate near neighbor (ANN) problem, the algorithm is allowed to report a point p whose distance to the query is at most cr if a point within distance r of the query exists, for some prespecified constant

Perhaps the most prominent approach to get an ANN data structure is via Locality Sensitive Hashing (LSH) [IM98, HIM12], which leads to sub-linear query time and subquadratic space. In particular, for one can get a query time of where for the distance metric the distance metric ]. The idea of the LSH method is to hash all the points using several hash functions that are chosen randomly, with the property that closer points have a higher probability of collision than the far points. Therefore, the closer points to a query have a higher probability of falling into a bucket being probed than far points. Thus, reporting a random point from a random bucket computed for the query, produces a distribution that is biased by the distance to the query: closer points to the query have a higher probability of being chosen.

When random nearby is better than nearest. The bias mentioned above towards nearer points is usually a good property, but is not always desirable. Indeed, consider the following scenarios:

(I) The nearest neighbor might not be the best if the input is noisy, and the closest point might be viewed as an unrepresentative outlier. Any point in the neighborhood might be then considered to be equivalently beneficial. This is to some extent why k-NN classification [ELL09] is so effective in reducing the effect of noise.

(II) However, k-NN works better in many cases if k is large, but computing the k nearestneighbors is quite expensive if k is large [HAAA14]. Computing quickly a random nearby neighbor can significantly speed-up such classification.

(III) We are interested in annonymizing the query [Ada07], thus returning a random near-neighbor might serve as first line of defense in trying to make it harder to recover the query. Similarly, one might want to anonymize the nearest-neighbor [QA08], for applications were we are interested in a “typical” data item close to the query, without identifying the nearest item.

(IV) If one wants to estimate the number of items with a desired property within the neighborhood, then the easiest way to do it is via uniform random sampling from the neighborhood. In particular, this is useful for density estimation [KLK12].

(V) Another natural application is simulating a random walk in the graph where two items are connected if they are in distance at most r from each other. Such random walks are used by some graph clustering algorithms [HK01].

1.1. Results

Our goal is to solve the near-neighbor problem, and yet be fair among “all the points” in the neighborhood. We introduce and study the fair near neighbor problem – where the goal is to report any point of Npq, rq with uniform distribution. That is, report a point within distance r of the query point with probability of Naturally, we study the approximate fair near neighbor problem, where one can hope to get efficient data-structures. We have the following results:

(II) Approximate neighborhood. This formulation reports an almost uniform distribution from an approximate neighborhood S of the query. We can provide such a data structure that uses space Spn, cq and whose query time is rOpQpn, cqq, albeit in expectation. See Lemma 4.3 for the exact statement.

Moreover, the algorithm produces the samples independently of past queries. In particular, one can assume that an adversary is producing the set of queries and has full knowledge of the data structure. Even then the generated samples have the same (almost) uniform guarantees. Furthermore, we remark that the new sampling strategy can be embedded in the existing LSH method to achieve unbiased query results. Finally, we remark that to get a distribution that is -uniform (See preliminaries for the definition), the dependence of our algorithms on

Very recently, independent of our work, [APS19] also provides a similar definition for the fair near neighbor problem.

Experiments. Finally, we compare the performance of our algorithm with the algorithm that uniformly picks a bucket and reports a random point, on the MNIST, SIFT10K, and GloVe data sets. Our empirical results show that while the standard LSH algorithm fails to fairly sample a point in the neighborhood of the query, our algorithm produces an empirical distribution which is much closer to the uniform distribution: it improves the statistical distance to the uniform distribution by a significant factor.

1.2. Data-structure: Sampling from a sub-collection of sets

We first study the more generic problem – given a collection F of sets from a universe of n elements, a query is a sub-collection of these sets and the goal is to sample (almost) uniformly from the union of the sets in this sub-collection. We do this by first sampling a set X in the sub-collection G proportional to the size of the set X, and then sampling an element x P X uniformly at random. This produces a distribution on elements in the sub-collection such that any element x is chosen proportional to its degree dpxq (i.e., the number of sets X P G that x P X). Therefore, we can use rejection sampling and only report x with probability 1{dpxq.

We can compute the degree by checking if x P X for all sets X P G in the collection, which takes time proportional to . Also because of the rejection sampling, we might need to repeat this process times which can be as large as g. This leads to expected runtime to generate a single sample, see Lemma 3.1.

As a first improvement, we approximate the degree using standard sampling techniques, which can be done in time rOpg{dpxqq. Although this can still be large for small degrees, however, those small values will also be rejected with a smaller probability. Using this, we can bound the runtime of a query by (the sampling is -uniform), where n is (roughly) the input size.

Our second improvement, which the authors believe to be quite interesting, follows by simulating this rejection sampling directly. This follows by first introducing a heuristic to approximate the degree, and then shortcutting it to get the desired simulation. Section 3.4 describes this modified algorithm. In particular, one can get uniform sampling with high probability. Specifically, one can sample uniformly in expected time, where the sampling succeeds with probability . Alternatively, one can sample -uniformly, with the expected running time being . This is a significant improvement, in both simplicity and dependency on , over the previous scheme.

We also show how to modify that data-structure to handle outliers, as it is the case for LSH, as the sampling algorithm needs to ignore such points once they are reported as a sample.

Applications. Here are a few examples of applications of such a data-structure (for sampling

from a union of sets):

(A) Given a subset X of vertices in the graph, randomly pick (with uniform distribution) a neighbor to one of the vertices of X. This can be used in simulating disease spread [KE05].

(B) Here, we use a variant of this data-structure to implement the fair ANN. (C) Uniform sampling for range searching [HQT14, AW17, AP19]. Indeed, consider a set of points, stored in a data-structure for range queries. Using the above, we can support sampling from the points reported by several queries, even if the reported answers are not disjoint. Being unaware of any previous work on this problem, we believe this data-structure is of independent interest.

1.3. Paper organization

We describe some basic sampling and approximation tools in Section 2. We describe the sampling from union of set data-structure in Section 3. The application of the data-structure to LSH is described in Section 4. The experiments are described in Section 5.

2. Preliminaries

Neighborhood, fair nearest-neighbor, and approximate neighborhood. Let pM, dq be a metric space and let be a set of n points. Let be the (close) ball of radius r around a point r-neighborhood of c in P. The size of the r-neighborhood is

Definition 2.1 (FANN). Given a data set points and a parameter r, the goal is to preprocess P such that for a given query q, one reports each point p P Npq, rq with probability is an approximately uniform probability distribution:

Definition 2.2 (FANN with approximate neighborhood). Given a data set and a parameter r, the goal is to preprocess them such that for a given query q, one reports each point p P S with probability is a point set such that

Set representation. Let U be an underlying ground set of n objects (i.e., elements). In this paper, we deal with sets of objects. Assume that such a set is stored in some reasonable data-structure, where one can insert delete, or query an object in constant time. Querying for an object o P U, requires deciding if o P X. Such a representation of a set is straightforward to implement using an array to store the objects, and a hash table. This representation allows random access to the elements in the set, or uniform sampling from the set.

If hashing is not feasible, one can just use a standard dictionary data-structure – this would slow down the operations by a logarithmic factor.

Subset size estimation. We need the following standard estimation tool, [BHRLemma 2.8].

Lemma 2.3. Consider two sets be parameters, such that . Assume that one is given an access to a membership oracle that, given an element x P U, returns whether or not x P B. Then, one can compute an estimate s, such that , and computing this estimates requires oracle queries. The returned estimate is correct with probability

Weighted sampling. We need the following standard data-structure for weighted sampling.

Lemma 2.4. Given a set of objects , with associated weights can preprocess them in Optq time, such that one can sample an object out of H. The probability of an object to be sampled is . In addition the data-structure supports updates to the weights. An update or sample operation takes Oplog tq time.

Proof: Build a balanced binary tree T, where the objects of G are stored in the leaves. Every internal node u of T, also maintains the total weight wpuq of the objects in its subtree. The tree T has height Oplog tq, and weight updates can be carried out in Oplog tq time, by updating the path from the root to the leaf storing the relevant object.

Sampling is now done as follows – we start the traversal from the root. At each stage, when being at node u, the algorithm considers the two children . It continues to with probability , and otherwise it continues into . The object sampled is the one in the leaf that this traversal ends up at.

3. Data-structure: Sampling from the union of sets

The problem. Assume you are given a data-structure that contains a large collection F of sets of objects. The sets in F are not necessarily disjoint. The task is to preprocess the data-structure, such that given a sub-collection of the sets, one can quickly pick uniformly at random an object from the set

Naive solution. The naive solution is to take the sets under consideration (in G), compute their union, and sample directly from the union set . Our purpose is to do (much) better – in particular, the goal is to get a query time that depends logarithmically on the total size of all sets in G.

3.1. Preprocessing

For each set X P F, we build the set representation mentioned in the preliminaries section. In addition, we assume that each set is stored in a data-structure that enables easy random access or uniform sampling on this set (for example, store each set in its own array). Thus, for each set X, and an element, we can decide if the element is in X in constant time.

3.2. Uniform sampling via exact degree computation

The query is a family , and define (which should be distinguished from of an element , is the number of sets of G that contains it – that is, algorithm repeatedly does the following:

(I) Picks one set from G with probabilities proportional to their sizes. That is, a set X P G is picked with probability |X| {m. (II) It picks an element x P X uniformly at random. (III) Computes the degree (IV) Outputs x and stop with probability 1{d. Otherwise, continues to the next iteration.

. The above algorithm samples an element according to the uniform distribution. The algorithm takes in expectation time. The query time is takes with high probability.

. Observe that an element is picked by step (II) with probability . The element x is output with probability . As such, the probability of x to be output by the algorithm in this round is . This implies that the output distribution is uniform on all the elements of

The probability of success in a round is n{m, which implies that in expectation m{n rounds are used, and with high probability Oppm{nq log nq rounds. Computing the degree time, which implies the first bound on the running time. As for the second bound, observe that an element can appear only once in each set of G, which readily implies that

3.3. Almost uniform sampling via degree approximation

The bottleneck in the above algorithm is computing the degree of an element. We replace this by an approximation.

Definition 3.2. Given two positive real numbers x and y, and a parameter numbers of each other, denoted by

In the approximate version, given an item , we can approximate its degree and get an improved runtime for the algorithm.

Lemma 3.3. The input is a family of sets F that one can preprocess in linear time. Let be a sub-family and let be a parameter. One can sample an element with almost uniform probability distribution. Specifically, the probability of an element to be output is . After linear time preprocessing, the query time is , in expectation, and the query succeeds with high probability.

, it follows that we need to approximate the size of . Given a set X P G, we can in constant time check if x P X, and as such decide if . It follows that we can apply the algorithm of Lemma 2.3, which requires time, where the algorithm succeeds with high probability. The query algorithm is the same as before, except that it uses the estimated degree.

For be the event that the element x is picked for estimation in a round, and let be the event that it was actually output in that round. Clearly, we have 1{d, where d is the degree estimate of (with high probability), it follows that . Since there are dpxq copies of x in G, and the element for estimation is picked uniformly from the sets of G, it follows that the probability of any element to be output in a round is

as As such, the probability of the algorithm terminating in a round is As for the expected amount of work in each round, observe that it is proportional to

Intuitively, since the expected amount of work in each iteration is W, and the expected number of rounds is 1, the expected running time is This argument is not quite right, as the amount of work in each round effects the probability of the algorithm to terminate in the round (i.e., the two variables are not independent). We continue with a bit more care – let be the running time in the ith round of the algorithm if it was to do an ith iteration (i.e., think about a version of the algorithm that skips the experiment in the end of the iteration to decide whether it is going to stop), and let be a random variable that is 1 if the (original) algorithm had not stopped at the end of the first i iterations of the algorithm.

By the above, we have that . Importantly, are independent (while are dependent). We clearly have that the running time of the algorithm is (here, we define 1). Thus, the expected running time of the algorithm is proportional to

because of linearity of expectations, and since are independent.

Remark 3.4. The query time of Lemma 3.3 deteriorates to if one wants the bound to hold with high probability. This follows by restarting the query algorithm if the query time exceeds (say by a factor of two) the expected running time. A standard application of Markov’s inequality implies that this process would have to be restarted at most Oplog nq times, with high probability.

Remark 3.5. The sampling algorithm is independent of whether or not we fully know the underlying family F and the sub-family G. This means the past queries do not affect the sampled object reported for the query G. Therefore, the almost uniform distribution property holds in the presence of several queries and independently for each of them.

3.4. Almost uniform sampling via simulation

It turns out that one can avoid the degree approximation stage in the above algorithm, and achieve only a polylogarithmic dependence on . To this end, let x be the element picked. We need to simulate a process that accepts x with probability 1{dpxq.

We start with the following natural idea for estimating dpxq – probe the sets randomly (with replacement), and stop in the ith iteration if it is the first iteration where the probe found a set that contains x. If there are g sets, then the distribution of i is geometric, with probability . In particular, in expectation, , which implies that . As such, it is natural to take g{i as an estimation for the degree of x. Thus, to simulate a process that succeeds with probability 1{dpxq, it would be natural to return 1 with probability i{g and 0 otherwise. Surprisingly, while this seems like a heuristic, it does work, under the right interpretation, as testified by the following.

Lemma 3.6. Assume we have g urns, and exactly of them, are non-empty. Furthermore, assume that we can check if a specific urn is empty in constant time. Then, there is a randomized algorithm, that outputs a number , such that . The expected running time of the algorithm is Opg{dq.

Proof: The algorithm repeatedly probes urns (uniformly at random), until it finds a non-empty urn. Assume it found a non-empty urn in the ith probe. The algorithm outputs the value i{g and stops.

Setting be the output of the algorithm. we have that

using the formula The expected number of probes performed by the algorithm until it finds a non-empty urn is 1, which implies that the expected running time of the algorithm is Opg{dq.

The natural way to deploy Lemma 3.6, is to run its algorithm to get a number y, and then return 1 with probability y. The problem is that y can be strictly larger than 1, which is meaningless for probabilities. Instead, we backoff by using the value y{∆, for some parameter ∆. If the returned value is larger than 1, we just treat it at zero. If the zeroing never happened, the algorithm would return one with probability 1– which we can use to our purposes via, essentially, amplification. Instead, the probability of success is going to be slightly smaller, but fortunately, the loss can be made arbitrarily small by taking ∆ to be sufficiently large.

Lemma 3.7. There are g urns, and exactly of them are not empty. Furthermore, assume one can check if a specific urn is empty in constant time. Let parameter. Then one can output a number , such that The expected running time of the algorithm is Opg{dq.

Proof: We modify the algorithm of Lemma 3.6, so that it outputs instead of i{g. If the algorithm does not stop in the first 1 iterations, then the algorithm stops and outputs 0. Observe that the probability that the algorithm fails to stop in the first iterations, for

Let Z be the random variable that is the number output by the algorithm. Arguing as in Lemma 3.6, we have that More precisely, we have

Let . We have that more, for ∆, we have

As such, we have that

Lemma 3.8. The input is a family of sets F that one preprocesses in linear time. Let be a sub-family and let be a parameter. One can sample an element with almost uniform probability distribution. Specifically, the probability of an element to be output is . After linear time preprocessing, the query time is , in expectation, and the query succeeds, with high probability (in g).

Proof: The algorithm repeatedly samples an element x using steps (I) and (II) of the algorithm of Section 3.2. The algorithm returns x if the algorithm of Lemma 3.7, invoked with returns 1. We have that ∆ . The algorithm returns x in this iteration with probability . Observe that which implies that , it follows that , as desired. The expected running time of each round is Opg{dpxqq.

Arguing as in Lemma 3.3, this implies that each round, in expectation takes Opng{mq time, where . Similarly, the expected number of rounds, in expectation, is Again, arguing as in Lemma 3.3, implies that the expected running time is

Remark 3.9. Similar to Remark 3.4, the query time of Lemma 3.8 can be made to work with high probability with an additional logarithmic factor. Thus with high probability, the query time is

3.5. Handling outliers

Imagine a situation where we have a marked set of outliers O. We are interested in sampling from . We assume that the total degree of the outliers in the query is at most some prespecified parameter . More precisely, we have

Lemma 3.10. The input is a family of sets F that one can preprocess in linear time. A query is a sub-family , a set of outliers O, a parameter , and a parameter

One can either

(A) Sample an element -approximate uniform distribution. Specifically, the probabilities of two elements to be output is the same up to a factor of 1

Proof: The main modification of the algorithm of Lemma 3.8, is that whenever we encounter an outlier (the assumption is that one can check if an element is an outlier in constant time), then we delete it from the set X where it was discovered. If we implement sets as arrays, this can be done by moving an outlier object to the end of the active prefix of the array, and decreasing the count of the active array. We also need to decrease the (active) size of the set. If the algorithm encounters more than outliers then it stops and reports that the number of outliers is too large.

Otherwise, the algorithm continues as before. The only difference is that once the query process is done, the active count (i.e., size) of each set needs to be restored to its original size, as is the size of the set. This clearly can be done in time proportional to the query time.

4. In the search for a fair near neighbor

In this section, we employ our data structure of Section 3 to show the two results on uniformly reporting a neighbor of a query point mentioned in Section 1.1. First, let us briefly give some preliminaries on LSH. We refer the reader to [HIM12] for further details. Throughout the section, we assume that our metric space, admits the LSH data structure.

4.1. Background on LSH

Locality Sensitive Hashing (LSH). Let D denote the data structure constructed by LSH, and let c denote the approximation parameter of LSH. The data-structure D consists of L hash functions -approximate LSH), which are chosen via a random process and each function hashes the points to a set of buckets. For a point be the bucket that the point p is hashed to using the hash function following are standard guarantees provided by the LSH data structure [HIM12].

Lemma 4.1. For a given query point . Then for any point p P Npq, rq, we have that with a probability of least 1, we have (i) i.e., the number of outliers is at most 3L. Moreover, the expected number of outliers in any single bucket is at most 1.

Therefore, if we take different data structures with corresponding hash functions to denote the ith hash function in the jth data structure, we have the following lemma.

Lemma 4.2. Let the query point be q, and let p be any point in Npq, rq. Then, with high probability, there exists a data structure , such that 3L.

By the above, the space used by and the query time is rOpLq.

4.2. Approximate Neighborhood

For be data structures constructed by LSH. Let F be the set of all buckets in all data structures, i.e., For a query point q, consider the family G of all buckets containing the query, i.e., and thus . Moreover, we let O to be the set of outliers, i.e., the points that are farther than cr from q. Note that as mentioned in Lemma 4.1, the expected number of outliers in each bucket of LSH is at most 1. Therefore, by Lemma 3.10, we immediately get the following result.

Lemma 4.3. Given a set P of n points and a parameter r, we can preprocess it such that given query q, one can report a point p P S with probability where S is a point set such that . The algorithm uses space Spn, cq and its expected query time is

, we know that , and moreover in expectation . We apply the algorithm of Lemma 3.10. The runtime of the algorithm is in expectation algorithm produces an almost uniform distribution over the points in S.

distance, the runtime of our algorithm is and for the distance, the runtime of our algorithm is . These matches the runtime of the standard LSH-based near neighbor algorithms up to polylog factors.

4.3. Exact Neighborhood

As noted earlier, the result of the previous section only guarantees a query time which holds in expectation. Here, we provide an algorithm whose query time holds with high probability. Note that, here we cannot apply Lemma 3.10 directly, as the total number of outliers in our data structure might be large with non-negligible probability (and thus we cannot bound ). However, as noted in Lemma 4.2, with high probability, there exists a subset of these data structures such that for each j P J, the number of outliers in at most 3L, and moreover, we have that . Therefore, on a high level, we make a guess , which we initialize it to , and start by drawing samples from G; once we encounter more than 3L outliers from a certain data structure , we infer that j R J, update the value of , and set the weights of the buckets corresponding to equal to 0, so that they will never participate in the sampling process. As such, at any iteration of the algorithm we are effectively sampling from

data structures which we refer to as and we keep the hashed points by the ith hash function of the jth data structure in the array denoted by . Moreover, for each bucket in , we store its size

Query Processing. We maintain the variables showing the weights of the bucket which is initialized to that is stored in the preprocessing stage. Moreover, we keep

the set of outliers detected from which is initially set to be empty. While running the algorithm, as we detect an outlier in , we add it to , and we further decrease by one. Moreover, in order to keep track of , for any data structure exceeds 3L, we will ignore all buckets in , by setting all corresponding

At each iteration, the algorithm proceeds by sampling a bucket proportional toits weight , but only among the set of buckets from those data structures less than 3L outliers are detected so far, i.e., . We then sample a point uniformly at random from the points in the chosen bucket that have not been detected as an outlier, i.e., . If the sampled point is an outlier, we update our data structure accordingly. Otherwise, we proceed as in Lemma 3.8.

Definition 4.5 (Active data structures and active buckets). Consider an iteration k of the algorithm. Let us define the set of active data structures to be the data structures from whom we have seen less than 3L outliers so far, and let us denote their indices by Moreover, let us define the active buckets to be all buckets containing the query in these active data structures, i.e.,

Observation 4.6. Lemma 4.2 implies that with high probability at any iteration k of the algorithm

Definition 4.7 (active size). For an active bucket , we define its active size to be shows the total number of points in the bucket that have not yet been detected as an outlier, i.e.,

Lemma 4.8. Given a set P of n points and a parameter r, we can preprocess it such that given a query q, one can report a point p P P with probability , so that there exists a value

The space used is rOpSpn, cqq and the query time is with high probability.

Proof: First note that the algorithm never outputs an outlier, and thus the third item is always satisfied. Next, let K be a random variable showing the number of iterations of the algorithm, and for an iteration k, define the random variable as the set of non-outlier points in the set of active buckets. Conditioned on know that the distribution of the output is almost uniform on . Moreover, we know that for all , and that by Observation 4.6, . Therefore, for all points in Npq, rq their probability of being reported as the final output of the algorithm is equal, and moreover, for all points in Npq, crqzNpq, rq, their probability of being reported is lower (as at some iteration, some of these points might go out of the set of active buckets). This proves the probability condition.

To bound the query time, let us consider the iterations where the sampled point p is an outlier, and not an outlier, separately. The total number of iterations where an outlier point is sampled is at most 3for which we only pay rOp1q cost. For non-outlier points, their total cost can be bounded using Lemma 3.8 and Remark 3.9 by

Lemma 4.9. Given a set P of n points and a parameter r, we can preprocess it such that given a query q, one can report a point p P S with probability is an approximately uniform probability distribution: algorithm uses space Spn, cq and has query time of probability.

Proof: We run Algorithm of Lemma 4.8, and while its output is outside of Npq, rq, we ignore it and run the algorithm again. By Lemma 4.8, the output is guaranteed to be almost uniform on Npq, rq. Moreover, by Lemma 4.8, and because with high probability, we only need to run the algorithm times, we get the desired bound on the query time.

5. Experiments

In this section, we consider the task of retrieving a random point from the neighborhood of a given query point, and evaluate the effectiveness of our proposed algorithm empirically on real data sets.

Data set and Queries. We run our experiments on three datasets that are standard benchmarks in the context of Nearest Neighbor algorithms (see [ABF17])

(I) Our first data set contains a random subset of 10K points in the MNIST training data set [LBBH98]. The full data set contains 60K images of hand-written digits, where each image is of size 28 by 28. For the query, we use a random subset of 100 (out of 10K) images of the MNIST test data set. Therefore, each of our points lie in a 784 dimensional Euclidean space and each coordinate is in r0, 255s. (II) Second, we take SIFT10K image descriptors that contains 10K 128-dimensional points as data set and 100 points as queries

(III) Finally, we take a random subset of 10K words from the GloVe data set [PSM14] and a random subset of 100 words as our query. GloVe is a data set of 1.2M word embeddings in 100-dimensional space and we further normalize them to unit norm.

We use the Euclidean distance to measure the distance between the points.

LSH data structure and parameters. We use the locality sensitive hashing data structure for the Euclidean distance [AI08]. That is, each of the L hash functions concatenation of k unit hash functions . Each of the unit hash functions is chosen by selecting a point in a random direction (by choosing every coordinate from a Gaussian distribution with parameters p0, 1q). Then all the points are projected onto this one dimensional direction. Then we put a randomly shifted one dimensional grid of length w along this direction. The cells of this grid are considered as buckets of the unit hash function. For tuning the parameters of LSH, we follow the method described in [DIIM04], and the manual of E2LSH library [And05], as follows.

For MNIST, the average distance of a query to its nearest neighbor in the our data set is around 4.5. Thus we choose the near neighbor radius 5. Consequently, as we observe, the r-neighborhood of at least half of the queries are non-empty. As suggested in [DIIM04] to set the value of 4, we tune it between 3 and 5 and set its value to k and L so that the false negative rate (the near points that are not retrieved by LSH) is less than 10%, and moreover the cost of hashing (proportional to L) balances out the cost of scanning. We thus get 100. This also agrees with the fact that L should be roughly square root of the total number of points. Note that we use a single LSH data structure as opposed to taking instances. We use the same method for the other two data sets. For SIFT, we use 100, and for GloVe we use

Algorithms. Given a query point q, we retrieve all L buckets corresponding to the query. We then implement the following algorithms and compare their performance in returning a neighbor of the query point.

: Picks bucket uniformly at random and picks a random point in bucket.

: Picks bucket according to its size, and picks uniformly random point inside bucket.

: Picks bucket according to size, and then picks uniformly random point p inside bucket. Then it computes and rejects p with probability 1

: Picks bucket according to size, and picks uniformly random point p inside bucket. It approximates p’s degree and rejects p with probability 1 1

Degree approximation method. We use the algorithm of Section 3.4 for the degree approximation: we implement a variant of the sampling algorithm which repeatedly samples a bucket uniformly at random and checks whether p belongs to the bucket. If the first time this happens is at iteration i, then it outputs the estimate as

Experiment Setup. In order to compare the performance of different algorithms, for each query q, we compute Mpqq: the set of neighbors of q which fall to the same bucket as q by at least one of the L hash functions. Then for 100|Mpqq| times, we draw a sample from the neighborhood of the query, using all four algorithms. We compare the empirical distribution of the reported points on |Mpqq| with the uniform distribution on it. More specifically, we compute the total variation distance (statistical distance)to the uniform distribution. We repeat each experiment 10 times and report the average result of all 10 experiments over all 100 query points.

Results. Figure 5.1 shows the comparison between all four algorithms. To compare their performance, we compute the total variation distance of the empirical distribution of the algorithms to the uniform distribution. For the tuned parameters (results are as follows. For MNIST, we see that our proposed degree approximation based algorithm performs only 2.4 times worse than the optimal algorithm, while we see that other standard sampling methods perform 6.6 times and 10 times worse than the optimal algorithm. For SIFT, our algorithm performs only 1.4 times worse than the optimal while the other two perform 6.1 and 9.7 times worse. For GloVe, our algorithm performs only 2.7 times worse while the other two perform 6.5 and 13.1 times worse than the optimal algorithm.

Moreover, in order get a different range of degrees and show that our algorithm works well for those cases, we further vary the parameters k and L of LSH. More precisely, to get higher ranges of the degrees, first we decrease k (the number of unit hash functions used in each of the L hash function); this will result in more collisions. Second, we increase L (the total number of hash functions). These are two ways to increase the degree of points. For example for the MNIST data set, the above procedure increases the degree range from r1, 33s to r1, 99s.

Query time discussion. As stated in the experiment setup, in order to have a meaningful comparison between distributions, in our code, we retrieve a random neighbor of each query 100m times, where m is the size of its neighborhood (which itself can be as large as 1000). We further repeat each experiment 10 times. Thus, every query might be asked upto 10times. This is going to be costly for the optimal algorithm that computes the degree exactly. Thus, we use the fact that we are asking the same query many times and preprocess the exact degrees for the optimal solution. Therefore, it is not meaningful to compare runtimes directly. Thus we run the experiments on a smaller size dataset to compare the runtimes of all the four approaches: For 100, our sampling approach is twice faster than the optimal algorithm, and almost five times slower than the other two approaches. However, when the number of buckets (L) increases from 100 to 300, our algorithm is 4.3 times faster than the optimal algorithm, and almost 15 times slower than the other two approaches.

Trade-off of time and accuracy. We can show a trade-off between our proposed sampling approach and the optimal. For the MNIST data set with tuned parameters (100), by asking twice more queries (for degree approximation), the solution of our approach improves from 2.4 to 1.6, and with three times more, it improves to 1.2, and with four times more, it improves to 1.05. For the SIFT data set (using the same parameters), using twice more queries, the solution improves from 1.4 to 1.16, and with three times more, it improves to 1.04, and with four times more, it improves to 1.05. For GloVe, using twice more queries, the solution improves from 2.7 to 1.47, and with three times more, it improves to 1.14, and with four times more, it improves to 1.01.

Figure 5.1: Comparison of the performance of the four algorithms is measured by computing the statistical distance of their empirical distribution to the uniform distribution.

6. Acknowledgement

The authors would like to thank Piotr Indyk for the helpful discussions about the modeling and experimental sections of the paper.

References

[ABDAlekh Agarwal, Alina Beygelzimer, Miroslav Dud´ık, John Langford, and Hanna M. Wallach. A reductions approach to fair classification. In Jennifer G. Dy and Andreas Krause, editors, Proc. 35th Int. Conf. Mach. Learning (ICML), volume 80 of Proc. of Mach. Learn. Research, pages 60–69. PMLR, 2018.

[ABF17] M. Aum¨uller, E. Bernhardsson, and A. Faithfull. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In International Conference on Similarity Search and Applications, 2017.

[Ada07] Eytan Adar. User 4xxxxx9: Anonymizing query logs. Appeared in the workshop Query Log Analysis: Social and Technological Challenges, in association with WWW 2007, 01 2007.

[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.

[And05] Alexandr Andoni. E2lsh 0.1 user manual. https://www.mit.edu/ andoni/LSH/manual.pdf, 2005.

[AP19] Peyman Afshani and Jeff M. Phillips. Independent range sampling, revisited again. CoRR, abs/1903.08014, 2019. to appear in SoCG 2019.

[APS19] Martin Aum¨uller, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search: Independent range sampling in high dimensions. arXiv preprint arXiv:1906.01859, 2019.

[AW17] Peyman Afshani and Zhewei Wei. Independent range sampling, revisited. In Kirk Pruhs and Christian Sohler, editors, 25th Annual European Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria, volume 87 of LIPIcs, pages 3:1–3:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.

[BCN19] Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for clustering. arXiv preprint arXiv:1901.02393, 2019.

[BHRPaul Beame, Sariel Har-Peled, Sivaramakrishnan Natarajan Ramamoorthy, Cyrus Rashtchian, and Makrand Sinha. Edge estimation with independent set oracles. CoRR, abs/1711.07567, 2017.

[BIOArturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. arXiv preprint arXiv:1902.03519, 2019.

[Cho17] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.

[CKLV17] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5029–5037, 2017.

[CKLV19] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvtiskii. Matroids, matchings, and fairness. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2212–2220. PMLR, 2019.

[DHPCynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.

[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Annu. Sympos. Comput. Geom. (SoCG), pages 253–262, 2004.

[DOBD18] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, pages 2791–2801, 2018.

[EJJHadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, and Zachary Schutzman. Fair algorithms for learning in allocation problems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 170–179. ACM, 2019.

[ELL09] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009.

[HAAA14] Ahmad Basheer Hassanat, Mohammad Ali Abbadi, Ghada Awad Altarawneh, and Ahmad Ali Alhasanat. Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach. CoRR, abs/1409.0919, 2014.

[HGBJiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch¨olkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.

[HIM12] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Theory Comput., 8:321–350, 2012. Special issue in honor of Rajeev Motwani.

[HK01] David Harel and Yehuda Koren. On clustering using random walks. In Ramesh Hariharan, Madhavan Mukund, and V. Vinay, editors, FST TCS 2001: Foundations of Software Technology and Theoretical Computer Science, 21st Conference, Bangalore, India, December 13-15, 2001, Proceedings, volume 2245 of Lecture Notes in Computer Science, pages 18–41. Springer, 2001.

[HPS16] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Neural Info. Proc. Sys. (NIPS), pages 3315–3323, 2016.

[HQT14] Xiaocheng Hu, Miao Qiao, and Yufei Tao. Independent range sampling. In Richard Hull and Martin Grohe, editors, Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, , pages 246–255. ACM, 2014.

[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), pages 604–613, 1998.

[KE05] Matt J Keeling and Ken T.D Eames. Networks and epidemic models. Journal of The Royal Society Interface, 2(4):295–307, September 2005.

[KLK12] Yi-Hung Kung, Pei-Sheng Lin, and Cheng-Hsiung Kao. An optimal k-nearest neighbor for density estimation. Statistics & Probability Letters, 82(10):1786 – 1791, 2012.

[KLLJon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2017.

[KSAM19] Matth¨aus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgenstern. Guarantees for spectral clustering with fairness constraints. arXiv preprint arXiv:1901.08668, 2019.

[LBBH98] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[MSP16] Cecilia Munoz, Megan Smith, and DJ Patil. Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights. Executive Office of the President and Penny Hill Press, 2016.

[OA18] Matt Olfat and Anil Aswani. Convex formulations for fair principal component analysis. arXiv preprint arXiv:1802.03765, 2018.

[PRWGeoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. In Advances in Neural Information Processing Systems, pages 5680–5689, 2017.

[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[QA08] Yinian Qi and Mikhail J. Atallah. Efficient privacy-preserving k-nearest neighbor search. In 28th IEEE International Conference on Distributed Computing Systems (ICDCS 2008), 17-20 June 2008, Beijing, China, pages 311–319. IEEE Computer Society, 2008.

[SDI06] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-neighbor methods in learning and vision: theory and practice (neural information processing). The MIT Press, 2006.

[TE11] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528, 2011.

[ZVGRG17] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Kr- ishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171–1180, 2017.

Designed for Accessibility and to further Open Science