b

DiscoverSearch
About
My stuff
A Distributional Framework for Data Valuation
2020·arXiv
Abstract
Abstract

Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for statistical aspects of the data and does not give a way to reason about points outside the data set.

To address these limitations, we propose a novel framework – distributional Shapley – where the value of a point is defined in the context of an underlying data distribution. We prove that distributional Shapley has several desirable statistical properties; for example, the values are stable under perturbations to the data points themselves and to the underlying data distribution. We leverage these properties to develop a new algorithm for estimating values from data, which comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms for computing the (non-distributional) data Shapley values. We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.

As data becomes an essential driver of innovation and service, how to quantify the value of data is an increasingly important topic of inquiry with policy, economic, and machine learning (ML) implications. In the policy arena, recent proposals, such as the Dashboard Act in the U.S. Senate, stipulate that large companies quantify the value of data they collect. In the global economy, the business model of many companies involves buying and selling data. For ML engineering, it is often beneficial to know which type of training data is most valuable and, hence, most deserving of resources towards collection and annotation. As such, a principled framework for data valuation would be tremendously useful in all of these domains.

Recent works initiated a formal study of data valuation in ML [GZ19, JDW+19b]. In a typical setting, a data set  B = {zi}is used to train a ML model, which achieves certain performance, say classification accuracy 0.9. The data valuation problem is to assign credit amongst the training set, so that each point gets an “equitable” share for its contribution towards achieving the 0.9 accuracy. Most works have focused on leveraging Shapley value as the metric to quantify the contribution of individual  zi. The focus on Shapley value is in large part due to the fact that Shapley uniquely satisfies basic properties for equitable credit allocation [Sha53]. Empirical experiments also show that data Shapley is very effective – more so than leave-one-out scores – at identifying points whose addition or removal substantially impacts learning [GAZ17,GZ19].

At a high-level, prior works on data Shapley require three ingredients: (1) a fixed training data set of m points; (2) a learning algorithm; and (3) a performance metric that measures the overall value of a trained model. The goal of this work is to significantly reduce the dependency on the first ingredient. While convenient, formulating the value based on a fixed data set disregards crucial statistical considerations and, thus, poses significant practical limitations.

In standard settings, we imagine that data is sampled from a distribution D; measuring the Shapley value with respect to a fixed data set ignores this underlying distribution. It also means that the value of a data point computed within one data set may not make sense when the point is transferred to a new data set. If we actually want to buy and sell data, then it is important that the value of a given data point represents some intrinsic quality of the datum within the distribution. For example, a data seller might determine that z has high value based on their data set  Bs and sell zto a buyer at a high price. Even if the buyer’s data set  Bbis drawn from a similar distribution as Bs, the existing data Shapley framework provides no guarantee of consistency between the value of z computed within  Bsand within  Bb. This inconsistency may be especially pronounced in the case when the buyer has significantly less data than the seller.

Our contributions

Conceptual. Extending prior works on data Shapley, we formulate and develop a notion of distributional Shapley value in Section 2. We define the distributional variant in terms of the original data Shapley: the distributional Shapley value is taken to be the expected data Shapley value, where the data set is drawn i.i.d. from the underlying data distribution. Reformulating this notion of value as a statistical quantity allows us to prove that the notion is stable with respect to perturbations to the inputs as well as the underlying data distribution. Further, we show a mathematical identity that gives an equivalent definition of distributional Shapley as an expected marginal performance increase by adding the point, suggesting an unbiased estimator.

Algorithmic. In Section 3, we develop this estimator into a novel sampling-based algorithm, D-Shapley. In contrast to prior estimation heuristics, D-Shapley comes with strong formal approximation guarantees. Leveraging the stability properties of distributional Shapley value and the simple nature of our algorithm, we develop theoretically-principled optimizations to D-Shapley. In our experiments across diverse tasks, the optimizations lead to order-of-magnitude reductions in computational costs while maintaining the quality of estimations.

Empirical. Finally, in Section 4, we present a data pricing case study that demonstrates the consistency of values produced by D-Shapley. In particular, we show that a data broker can list distributional Shapley values as “prices,” which a collection of buyers all agree are fair (i.e. the data gives each buyer as much value as the seller claims). In all, our results demonstrate that the distributional Shapley framework represents a significant step towards the practical viability of the Shapley-based approaches to data valuation.

Related works

Shapley value, introduced in [Sha53], has been studied extensively in the literature on cooperative games and economics [SR+88], and has traditionally been used in the valuation of private information and data markets [KPR01,ADS19].

Our work is most directly related to recent works that apply Shapley value to the data valuation problem. [GZ19] developed the notion of “Data Shapley” and provided algorithms to efficiently estimate values. Specifically, leveraging the permutation-based characterization of Shapley value, they developed a “truncated Monte Carlo” sampling scheme (referred to as TMC-Shapley), demonstrating empirical effectiveness across various ML tasks. [JDW+19b] introduce several additional approximation methods for efficient computation of Shapley values for training data; subsequently, [JDW+19a] provided an algorithm for exact computation of Shapley values for the specific case of nearest neighbor classifiers.

Beyond data valuation, the Shapley framework has been used in a variety of ML applications, e.g. as a measure of feature importance [CDR07,K+10,DSZ16,LL17,CSWJ18]. The idea of a distributional Shapley value bears resemblance to the Aumann-Shapley value [AS74], a measure-theoretic variant of the Shapley that quantifies the value of individuals within a continuous “infinite game.” Our distributional Shapley value focuses on the tangible setting of finite data sets drawn from a (possibly continuous) distribution.

Preliminaries

Let D denote a data distribution supported on a universe Z. For supervised learning problems, we often think of  Z = X × Y where X ⊆ Rd and Yis the output, which can be discrete or continuous. For  m ∈ N, let S ∼ Dm a collection of k data points sampled i.i.d. from D. Throughout, we use the shorthand [m] = {1, . . . , m} and let k ∼ [m] denote a uniform random sample from [m].

We denote by  U : Z∗ →[0, 1] a potential function1or performance metric, where for any  S ⊆ Z, U(S) represents abstractly the value of the subset. While our analysis applies broadly, in our context, we think of U as capturing both the learning algorithm and the evaluation metric. For instance, in the context of training a logistic regression model, we might think of U(S) as returning the population accuracy of the empirical risk minimizer when S is the training set.

2.1 Distributional Shapley Value

Our starting point is the data Shapley value, proposed in [GZ19,JDW+19b] as a way to valuate training data equitably.

Definition 2.1 (Data Shapley Value). Given a potential function U and data set  B ⊆ Zwhere

|B| = m, the data Shapley value of a point  z ∈ Bis defined as

image

In words, the data Shapley value of a point  z ∈ Bis a weighted empirical average over subsets S ⊆ Bof the marginal potential contribution of z to each S; the weighting is such that each possible cardinality  |S| = k ∈ {0, . . . , m − 1}is weighted equally. The data Shapley value satisfies a number of desirable properties; indeed, it is the unique valuation function that satisfies the Shapley axioms2.Note that as the data set size grows, the absolute magnitude of individual data points’ values typically scales inversely.

While data Shapley value is a natural solution concept for data valuation, its formulation leads to several limitations. In particular, the values may be very sensitive to the exact choice of B; given another  B′ ̸= B where z ∈ B ∩ B′, the value  φ(z; U, B) might be quite different from  φ(z; U, B′). Atthe extreme, if a new point  z′ ̸∈ Bis added to B, then in principle, we would have to rerun the procedure to compute the data Shapley values for all points in  B ∪ {z′}.

In settings where our data are drawn from an underlying distribution D, a natural extension to the data Shapley approach would parameterize the valuation function by D, rather than the specific draw of the data set. Such a distributional Shapley value should be more stable, by removing the explicit dependence on the draw of the training data set.

Definition 2.2 (Distributional Shapley Value). Given a potential function  U : Z∗ →[0, 1], a distribution D supported on  Z, and some m ∈ N, the distributional Shapley value of a point  z ∈ Zis the expected data Shapley value over data sets of size m containing x.

image

In other words, we can think of the data Shapley value as a random variable that depends on the specific draw of data from D. Taking the distributional Shapley value  ν(z; U, D, m) to be the expectation of this random variable eliminates instability caused by the variance of  φ(z; U, B). Whiledistributional Shapley is simple to state based on the original Shapley value, to the best of our knowledge, the concept is novel to this work.

We note that, while more stable, the distributional Shapley value inherits many of the desirable properties of Shapley, including the Shapley axioms and an expected efficiency property; we cover these in Appendix B. Importantly, distributional Shapley also has a clean characterization as the expected gain in potential by adding  z ∈ Zto a random data set (of random size).

Theorem 2.3. Fixing U and D, for all z ∈ Z and m ∈ N,

image

That is, the distributional Shapley value of a point is its expected marginal contribution in U to a set of i.i.d. samples from D of uniform random cardinality.

Proof. The identity holds as a consequence of the definition of data Shapley value and linearity of expectation.

image

where (1) follows by the fact that  D ∼ Dm−1 consists of i.i.d. samples, so each  S ⊆ D with |S| = k−1is identically distributed according to  Dk−1.

Example: mean estimation. Leveraging this characterization, for well-structured problems, it is possible to give analytic expressions for the distributional Shapley values. For instance, consider estimating the mean  µof a distribution D supported on  Rd. For a finite subset  S ⊆ Rd, we take a potential U(S) based on the empirical estimator ˆµS.

image

Proposition 2.4. Suppose D has bounded second moments. Then for  z ∈ Zand  m ∈ N, ν(z; Uµ, D, m)for mean estimation over D is given by

image

for an explicit constant  Cm = Θ(1)determined by m.

Intuitively, this proposition (proved in Appendix B) highlights some desirable properties of distributional Shapley: the expected value for a random  z ∼ Dis an uniform share of the potential for a randomly drawn data set  S ∼ Dm; further, a point has above-average value when it is closer to  µthan expected. In general, analytically deriving the distributional Shapley value may not be possible. In Section 3, we show how the characterization of Theorem 2.3 leads to an efficient algorithm for estimating values.

2.2 Stability of distributional Shapley values

Before presenting our algorithm, we discuss stability properties of distributional Shapley, which are interesting in their own right, but also have algorithmic implications. We show that when the potential function U satisfies a natural stability property, the corresponding distributional Shapley value inherits stability under perturbations to the data points and the underlying data distribution. First, we recall a standard notion of deletion stability, often studied in the context of generalization of learning algorithms [BE02].

Definition 2.5 (Deletion Stability). For potential  U : Z∗ → [0, 1]and non-increasing  β : N → [0, 1],U is β(k)-deletion stable if for all  k ∈ N and S ∈ Zk−1, for all z ∈ Z

image

We can similarly discuss the idea of replacement stability, where we bound  |U(S ∪ {z}) − U(S ∪ {z′})|;note that by the triangle inequality,  β(k)-deletion stability of  U implies 2β(k)-replacement stability. To analyze the properties of distributional Shapley, a natural strengthening of replacement stability will be useful, which we call Lipschitz stability. Lipschitz stability is parameterized by a metric d, requires the degree of robustness under replacement of  z with z′ to scale according to the distance d(z, z′).

Definition 2.6 (Lipschitz Stability). Let (Z, d) be a metric space. For potential  U : Z∗ →[0, 1] and non-increasing  β : N →[0, 1], U is  β(k)-Lipschitz stable with respect to d if for all  k ∈ N, S ∈ Zk−1, and all z, z′ ∈ Z,

image

By taking d to be the trivial metric, where  d(z, z′) = 1 if z ̸= z′, we see that Lipschitz-stability generalizes the idea of replacement stability; still, there are natural learning algorithms that satisfy Lipschitz stability for nontrivial metrics. As one example, we show that Regularized empirical risk minimization over a Reproducing Kernel Hilbert Space (RKHS) – a prototypical example of a replacement stable learning algorithm – also satisfies this stronger notion of Lipschitz stability. We include a formal statement and proof in Appendix C.

Similar points receive similar values. As discussed, a key limitation with the data Shapley approach for fixed data set B is that we can only ascribe values to  z ∈ B. Intuitively, however, we would hope that if two points  z and z′ are similar according to some appropriate metric, then they would receive similar Shapley values. We confirm this intuition for distributional Shapley values when the potential function U satisfies Lipschitz stability.

Theorem 2.7. Fix a metric space (Z, d) and a distribution D over Z; let  U : Z∗ →[0, 1] be β(k)-Lipschitz stable with respect to d. Then for all  m ∈ N, for all z, z′ ∈ Z,

image

Proof. For any data set size  m ∈ N, we expand  ν(z′; U, D, m) to express it in terms of  ν(z; U, D, m).

image

where (2) follows by the assumption that  U is β(k)-Lipschitz stable and linearity of expectation.

Theorem 2.7 suggests that in many settings of interest, the distributional Shapley value will be Lipschitz in z. This Lipschitz property also suggests that, given the values of a (sufficiently-diverse) set of points Z, we may be able to infer the values of unseen points  z′ ̸∈ Zthrough interpolation. Concretely, in Section 3.2, we leverage this observation to give an order of magnitude speedup over our baseline estimation algorithm.

Similar distributions yield similar value functions. The distributional Shapley value is naturally parameterized by the underlying data distribution D. For two distributions  Dsand  Dt, given the value  ν(z; U, Ds, m), what can we say about the value  ν(z; U, Dt, m)? Intuitively, if  Dsand  Dtare similar under an appropriate metric, we’d expect that the values should not change too much. Indeed, we can formally quantify how the distributional Shapley value is stable under distributional shift under the Wasserstein distance.

For two distributions  Ds, Dt over Z, let Γstbe the collection of joint distributions over  Z ×Z, whosemarginals are  Ds and Dt.3 Fixing a metric d over Z, the Wasserstein distance is the infimum over all such couplings  γ ∈ Γstof the expected distance between (s, t) ∼ γ.

image

We formalize the idea that distributional Shapley values are stable under small perturbations to the underlying data distribution as follows.

Theorem 2.8. Fix a metric space (Z, d) and let  U : Z∗ →[0, 1] be  β(k)-Lipschitz stable with respect to  d. Suppose Ds and Dtare two distributions over Z. Then, for all  m ∈ N and all z ∈ Z,

image

Proof. For notational convenience, for any  z ∈ Zand subset  S ⊆ Z, we denote ∆zU(S) =U(S ∪ {z}) − U(S). Thus, fixing  z ∈ Z, we can write  ν(z; U, D, m) as  Ek∼[m] ES∼Dk−1 [∆zU(S)]. We analyze  ES∼Dk−1 [∆zU(S)] for each fixed  k ∈ {2, . . . , m}separately.4

Let  γ ∈ Γstbe some coupling of  Ds and Dt. Then, we can expand the expectation as follows.

image

where (4) and (6) follow by the assumption that the marginals of  γ are Ds and Dt; and (5) follows by linearity of expectation.

To bound the first term of (6), we expand the difference between ∆zU(S) and ∆zU(T) into a telescoping sum of k pairs of terms, where we bound each pair to depend on a single draw (si, ti) ∼ γ.For  S, T ∈ Zkand  i ∈ {0, . . . , k}, denote by  Zi =��kj=i+1 sj�∪��ij=1 tj�; note that  Z0 = Sand Zk = T. Then, we can rewrite ∆zU(S) − ∆zU(T) as follows.

image

Now suppose U is  β(k)-Lipschitz stable with respect to d; note that this implies ∆zUis 2β(k)-Lipschitz stable (because  βis non-increasing). Then, we obtain the following bound.

image

where (7) notes  Zi−1 and Zidiffer on only the ith data point; (8) follows from the assumption that ∆zUis 2β(k)-Lischitz stable and linearity of expectation; and finally (9) follows by the fact that each draw from  γ is i.i.d.

Finally, we note that the argument above worked for an arbitrary coupling in Γst; thus, we can express the difference in values in terms of the infimum over Γst.

image

where the first summation is taken over  k ∈ {2, . . . , m}as the term associated with k = 1 is 0.

Note that the theorem bounds the difference in values under shifts in distribution holding the potential U fixed. Often in applications, we will take the potential function to depend on the underlying data distribution. For instance, we may take to be a measure of population accuracy, e.g.  UDs = 1 − Ez∼D [ℓS(z)], where  ℓS(z) is the loss on a point  z ∈ Zachieved by a model trained on the data set  S ⊆ Z. In the case where we only have access to samples from  Ds, we still may want to guarantee that  ν(z; UDs, Ds, m) and  ν(z; UDt, Dt, m) are close. Thankfully, such a result follows by showing that  UDsis close to  UDt, and another application of the triangle inequality. For instance, when the potential is based on the population loss for a Lipschitz loss function, we can bound the difference in the potentials, again, in terms of the Wasserstein distance.

image

Here, we describe an estimation procedure, D-Shapley, for computing distributional Shapley values. To begin, we assume that we can actually sample from the underlying D. Then, in Section 3.2, we propose techniques to speed up the estimation and look into the practical issues of obtaining samples from the distribution. The result of these considerations is a practically-motivated variant of the estimation procedure, Fast-D-Shapley. In Section 3.3, we investigate how these optimizations perform empirically; we show that the strategies provide a way to smoothly trade-off the precision of the valuation for computational cost.

3.1 Obtaining unbiased estimates

The formulation from Theorem 2.3 suggests a natural algorithm for estimating the distributional Shapley values of a set of points. In particular, the distributional Shapley value  ν(z; U, D, m) is theexpectation of the marginal contribution of z to  S ⊆ Zon U, drawn from a specific distribution over data sets. Thus, the change in performance when we add a point z to a data set S drawn from the correct distribution will be an unbiased estimate of the distributional Shapley value. Consider the Algorithm 1, D-Shapley, which given a subset  Z0 ⊆ Zof data, maintains for each  z ∈ Z0a running average of  U(S ∪ {z}) − U(S) over randomly drawn S.

In each iteration, Algorithm 1 uses a fixed sample  Stto estimate the marginal contribution to U(St ∪ {z}) − U(St) for each  z ∈ Z. This reuse correlates the estimation errors between points in Z, but provides computational savings. Recall that each evaluation of U(S) requires training a ML model using the points in S; thus, using the same S for each  z ∈ Zreduces the number of models to be trained by |Z| per iteration. In cases where the  U(S ∪ {z}) can be derived efficiently from U(S), the savings may be even more dramatic; for instance, given a machine-learned model trained on S, it may be significantly cheaper to derive a model trained on  S ∪ {z}than retraining from scratch [GGVZ19].

The running time of Algorithm 1 can naively be upper bounded by the product of the number of iterations before termination T, the cardinality |Z| of the points to valuate, and the expected time

image

to evaluate U on data sets of size  k ∼ [m]. We analyze the iteration complexity necessary to achieve ε-approximations of  ν(z; U, D, m) for each z ∈ Z.

Theorem 3.1. Fixing a potential U and distribution D, and  Z ⊆ Z, suppose  T ≥ Ω�log(|Z|/δ)ε2 �.

Algorithm 1 produces unbiased estimates and with probability at least 1−δ, |ν(z; U, D, m) − νT (z)| ≤ ε.for all  z ∈ Z.

Remark. When understanding this (and future) formal approximation guarantees, it is important to note that we take  ε to be an absoluteadditive error. Recall, however, that  ν(z; U, D, m)is normalized by m; thus, as we take m larger, the relative error incurred by a fixed  εerror grows. In this sense, εshould typically scale inversely as O(1/m).

The claim follows by proving uniform convergence of the estimates for each  z ∈ Z. Importantly, while the samples in each iteration are correlated across  z, z′ ∈ Z, fixing z ∈ Z, the samples ∆zU(St)are independent across iterations. We include a formal analysis in Appendix D.

3.2 Speeding up D-Shapley: theoretical and practical considerations

Next, we propose two principled ways to speed up the baseline estimation algorithm. Under stability assumptions, the strategies maintain strong formal guarantees on the quality of the learned valuation. We also develop some guiding theory addressing practical issues that arise from the need to sample from D. Somewhat counterintuitively, we argue that given only a fixed finite data set  B ∼ DM, wecan still estimate values  ν(z; U, D, m) to high accuracy, for M that grows modestly with m.

Subsampling data and interpolation. Theorem 2.7 shows that for sufficiently stable potentials U, similar points have similar distributional Shapley values. This property of distributional Shapley values is not only useful for inferring the values of points  z ∈ Zthat were not in our original data set, but also suggests an approach for speeding up the computations of values for a fixed  Z ⊆ Z. Inparticular, to estimate the values for  z ∈ Z(with respect to a sufficiently Lipschitz-stable potential U) to O(ε)-precision, it suffices to estimate the values for an  ε-cover of Z, and interpolate (e.g. via nearest neighbor search). Standard arguments show that random sampling is an effective way to construct an  ε-cover [HP11].

As our first optimization, in Algorithm 2, we reduce the number of points to valuate through subsampling. Given a data set Z to valuate, we first choose a random subset  Zp ⊆ Z(where each z ∈ Zis subsampled into  Zpi.i.d. with some probability p); then, we run our estimation procedure on the points in  Zp; finally, we train a regression model on (z, νT (z)) pairs from  Zpto predict the values of the points from  Z \ Zp. By varying the choice of  p ∈ [0,1], we can trade-off running time for quality of estimation:  p ≈1 recovers the original D-Shapley scheme, whereas  p ≈0 will be very fast but likely produce noisy valuations.

Importance sampling for smaller data sets. To understand the running time of Algorithm 1 further, we denote the time to evaluate U on a set of cardinality  k ∈ Nby  R(k).5As such, we can express the asymptotic expected running time as  |Z| · T · Ek∼[m] [R(k)]. Note that when U(S) corresponds to the accuracy of a model trained on S, the complexity of evaluating U(S) may grow significantly with |S|. At the same time, as the data set size k grows, the marginal effect of adding  z ∈ Zto the training set tends to decrease; thus, we should need fewer large samples to accurately estimate the marginal effects. Taken together, intuitively, biasing the sampling of k ∈ [m] towards smaller training sets could result in a faster estimation procedure with similar approximation guarantees.

Concretely, rather than sampling  k ∼ [m] uniformly, we can importance sample each k proportional to some non-uniform weights  {wk : k ∈ [m]}, where the weights decrease for larger k. More formally, we weight the draw of k based on the stability of U. Algorithm 2 takes as input a set of importance weights  w = {wk}and samples k proportionally; without loss of generality, we assume �k wk = 1and let  k ∼ [m]wdenote a sample drawn such that  Pr[k] = wk. We show that for the right choice of weights w, sampling  k ∼ [m]wimproves the overall running time, while maintaining  ε-accurate unbiased estimates of the values  ν(z; U, D, m).

Theorem 3.2 (Informal). Suppose U is O(1/k)-deletion stable and can be evaluated on sets of cardinality k in time  R(k) ≥Ω(k). For  p ∈[0, 1] and  w = {wk ∝ 1/k}, Algorithm 2 produces estimates that with probability 1  − δ, are ε-accurate for all  z ∈ Zpand runs in expected time

image

To interpret this result, note that if the subsampling probability p is large enough that  Zpwill ε-cover Z, then using a nearest-neighbor predictor as R will produce  O(ε)-estimates for all  z ∈ Z. Further, if we imagine  ε = Θ(1/k), then the computational cost grows as the time it takes to train a model on m points scaled by a factor logarithmic in |Z| and the failure probability. In fact, Theorem 3.2 is a special case of a more general theorem that provides a recipe for devising an appropriate sampling scheme based on the stability of the potential U. In particular, the general

image

theorem (stated and proved in Appendix D) shows that the more stable the potential, the more we can bias sampling in favor of smaller sample sizes.

Estimating distributional Shapley from data. Estimating distributional Shapley values ν(z; U, D, m) requires samples from the distribution D. In practice, we often want evaluate the values with respect to a distribution D for which we only have some database  B ∼ DMfor some large (but finite)  M ∈ N. In such a setting, we need to be careful; indeed, avoiding artifacts from a single draw of data is the principle motivation for introducing the distributional Shapley framework. In fact, the analysis of Theorem 3.2 also reveals an upper bound on how big the database should be in order to obtain accurate estimates with respect to D. As a concrete bound, if U is O(1/k)-deletion stable and we take  ε = Θ(1/m) error, then the database need only be

image

In other words, for a sufficiently stable potential U, the data complexity grows modestly with m. Note that, again, this bound leverages the fact that in every iteration, we reuse the same sample St ∼ Dk for each z ∈ Z. See Appendix D for a more detailed analysis.

In practice, we find that sampling subsets of data from the database with replacement works well; we describe the full procedure in Algorithm 2, where we denote an i.i.d. sample of k points drawn uniformly from the database as  S ∼ Bk. Finally, we note that ideally, m should be close to the size of the training sets that model developers to use; in practice, these data set sizes may vary widely. One appealing aspect of both D-Shapley algorithms is that when we estimate values with respect to m, the samples we obtain also allow us to simultaneously estimate  ν(z; U, D, m′) for any m′ ≤ m.Indeed, we can simply truncate our estimates to only include samples corresponding to  Stwith |St| ≤ m′.

3.3 Empirical performance

We investigate the empirical effectiveness of the distributional Shapley framework by running experiments in three settings on large real-world data sets. The first setting uses the UK Biobank data set, containing the genotypic and phenotypic data of individuals in the UK [SGA+15]; we evaluate a task of predicting whether the patient will be diagnosed with breast cancer using 120 features. Overall, our data has 10K patients (5K diagnosed positively); we use 9K patients as our database (B), and take classification accuracy on a hold-out set of 500 patients as the performance metric (U). The second data set is Adult Income where the task is to predict whether income exceeds $50K/yr given 14 personal features [DG17]. With 50K individuals total, we use 40K as our database, and classification accuracy on 5K individuals as our performance metric. In these two experiments, we take the maximum data set size m = 1K and m = 5K, respectively.

For both settings, we first run D-Shapley without optimizations as a baseline. As a point of comparison, in these settings the computational cost of this baseline is on the same order as running the TMC-Shapley algorithm of [GZ19] that computes the data Shapley values  φ(z; U, B) for eachz in the data set B. Given this baseline, we evaluate the effectiveness of the proposed optimizations, using weighted sampling and interpolation (separately), for various levels of computational savings. In particular, we vary the sampling weights  {wk}and subsampling probability p to vary the computational cost (where weighting towards smaller k and taking p smaller each yield more computational savings). All algorithms are truncated when the average absolute change in value in the past 100 iterations is less than 1%.

To evaluate the quality of the distributional Shapley estimates, we perform a point removal experiment, as proposed by [GZ19], where given a training set, we iteratively remove points, retrain the model, and observe how the performance changes. In particular, we remove points from most to least valuable (according to our estimates), and compare to the baseline of removing random points. Intuitively, removing high value data points should result in a more significant drop in

image

Figure 1: Point removal performance. Given a data set and task, we iteratively a point, retrain the model, and evaluate its performance. Each curve corresponds to a different point removal order, based on the estimated distributional Shapley values (compared to random). For example, the 10% curve correspond to estimating values with 10% of the baseline computation of Algorithm 1. We plot classification accuracy vs. fraction of data points removed from the training set, for each task and each optimization method.

image

Figure 2: Smooth trade-off between computation and recovery. For each task, we plot the R2coefficient between the values computed using Algorithm 1 vs. the relative computational cost (as in Figure 1). The results show that there is a smooth trade-off between the recovery precision of the distributional Shapley values and the cost, across a wide range of learning algorithms.

the model’s performance. We report the results of this point removal experiment using the values determined using the baseline Algorithm 1, as well as various factor speed-ups (where t% refers to the computational cost compared to baseline).

As Figure 1 demonstrates, when training a logistic regression model, removing the high distributional Shapley valued points causes a sharp decrease in accuracy on both tasks, even when using the most aggressive weighted sampling and interpolation optimizations. Appendix E reports the results for various other models. As a finer point of investigation, we report the correlation between the estimated values without optimizations and with various levels of computational savings, for a handful of prediction models. Figure 2 plots the  R2 curves and shows that the optimizations provide a smooth interpolation between computational cost and recovery, across every model type. It is especially interesting that these trade-offs are consistently smooth across a variety of models using the 01-loss, which do not necessarily induce a potential U with formal guarantees of stability.

In our final setting, we push the limits of what types of data can be valuated. Specifically, by combining both weighted sampling and interpolation (resulting in a 500×speed-up), we estimate the values of 50K images from the CIFAR10 data set; valuating this data set would be prohibitively expensive using prior Shapley-based techniques. In particular, to obtain accurate estimates for each point, TMC-Shapley would require an unreasonably large number of Monte Carlo iterations due to the sheer size of the data base to valuate. We valuate points based on an image classification task, and demonstrate that the estimates identify highly valuable points, in Appendix E.

Next, we consider a natural setting where a data broker wishes to sell data to various buyers. Each buyer could already own some private data. In particular, suppose the broker plans to sell the set S and a buyer holds a private data set B; in this case, the relevant values are the data Shapley values  φ(z; U, B ∪ S) for each z ∈ S. Within the original data Shapley framework, computing these values requires a single party to hold both B and S. For a multitude of financial and legal concerns,

image

Figure 3: Consistent Pricing. Each buyer holds a data set B; the seller sells a data set S, where |B| = |S| = m. We compare the values estimated by the seller  ν(z; U, D, m) and φ(z; U, B ∪ S). (a)For various data sets and two data set sizes (m = 100 and m = 500): in blue, we plot the average rank correlation between  ν(z) and  φ(z) for  z ∈ S; in red, we plot the average absolute percentage error between the seller’s and buyer’s estimates. (b) Points from S are added to B in three different orders: according to  ν (D-Shapley), according to  φ(TMC), and randomly. The plot shows the change in the accuracy of the model, relative to its performance using the buyer’s initial dataset, as the points are added; shading indicates standard error of the mean.

neither party may be willing to send their data to the other before agreeing to the purchase. Such a scenario represents a fundamental limitation of the non-distributional Shapley framework that seemed to jeopardize its practical viability. We argue that the distributional Shapley framework largely resolves this particular issue: without exchanging data up front, the broker simply estimates the values  ν(z; U, D, m); in expectation, these values will accurately reflect the value to a buyer with a private data set B drawn from a distribution close to D.

We report the results of this case study on four large different data sets in Figure 3, whose details are included in Appendix F. For each data set, a set of buyers holds a small data set B (100 or 500 points), and the broker sells them a data set S of the same size; the buyers then valuate the points in S by running the TMC-Shapley algorithm of [GZ19] on  B ∪ S. In Figure 3(a), we show that the rank correlation between the broker’s distributional estimates  ν(z; U, D, m) and thebuyer’s observed values  φ(z; U, B ∪ S) is generally high. Even when the rank correlation is a bit lower (≈ 0.6), the broker and buyer agree on the value of the set as a whole. Specifically, we observe that the seller’s estimates are approximately unbiased, and the absolute percentage error is low, where

image

In Figure 3(b), we show the results of a point addition experiment for the Diabetes130 data set. Here, we consider the effect of adding the points of S to B under three different orderings: according to the broker’s estimates  ν(z; U, D, m), according to the buyer’s estimates  φ(z; U, B ∪ S), and under a random ordering. We observe that the performance (classification accuracy) increase by adding the points according to  ν(z) and according to  φ(z) track one another well; after the addition of all of S, the resulting models achieve essentially the same performance and considerably outperforming random. We report results for the other data sets in Appendix F.

The present work makes significant progress on understanding statistical aspects in determining the value of data. In particular, by reformulating the data Shapley value as a distributional quantity, we obtain a valuation function that does not depend on a fixed data set; reducing the dependence on the specific draw of data eliminates inconsistencies in valuation that can arise to sampling artifacts. Further, we demonstrate that the distributional Shapley framework provides an avenue to valuate data across a wide variety of tasks, providing stronger theoretical guarantees and orders of magnitude speed-ups over prior estimation schemes. In particular, the stability results that we prove for distributional Shapley (Theorems 2.7 and 2.8) are not generally true for the original data Shapley due to its dependence on a fixed dataset.

One outstanding limitation of the present work is the reliance on a known task, algorithm, and performance metric (i.e. taking the potential U to be fixed). We propose reducing the dependence on these assumptions as a direction for future investigations; indeed, very recent work has started to chip away at the assumption that the learning algorithm is fixed in advance [YGZ19].

The distributional Shapley perspective also raises the thought-provoking research question of whether we can valuate data while protecting the privacy of individuals who contribute their data. One severe limitation of the data Shapley framework, is that the value of every point depends nontrivially on every other point in the data set. In a sense, this makes the data Shapley value an inherently non-private value: the estimate of  φ(z; U, B) for a point  z ∈ Breveals information about the other points in B. By marginalizing the dependence on the data set, the distributional Shapley framework opens the door for to estimating data valuations while satisfying strong notions of privacy, such as differential privacy [DMNS06]. Such an estimation scheme could serve as a powerful tool amidst increasing calls to ensure the privacy of and compensate individuals for their personal data [LNG19].

[ADS19] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701–726, 2019.

[Aga11] Shivani Agarwal. Algorithmic stability. Lecture notes on Statistical Learning Theory, 2011.

[AS74] Robert J Aumann and Lloyd S Shapley. Values of non-atomic games. Princeton University Press, 1974.

[BE02] Olivier Bousquet and Andr´e Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.

[CDR07] Shay Cohen, Gideon Dror, and Eytan Ruppin. Feature selection via coalitional game theory. Neural Computation, 19(7):1939–1961, 2007.

[CSWJ18] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. L-shapley and c-shapley: Efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610, 2018.

[DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284, 2006.

[DSZ16] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.

[GAZ17] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv preprint arXiv:1710.10547, 2017.

[GGVZ19] Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems, pages 3513–3526, 2019.

[GZ19] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251, 2019.

[HP11] Sariel Har-Peled. Geometric approximation algorithms. Number 173. American Mathematical Soc., 2011.

[JDW+19a]Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11):1610–1623, 2019.

[JDW+19b]Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve G¨urel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176, 2019.

[K+10]Igor Kononenko et al. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11(Jan):1–18, 2010.

[KPR01] Jon Kleinberg, Christos H Papadimitriou, and Prabhakar Raghavan. On the value of private information. In Theoretical Aspects Of Rationality And Knowledge: Proceedings of the 8 th conference on Theoretical aspects of rationality and knowledge, volume 8, pages 249–257. Citeseer, 2001.

[LL17] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.

[LNG19] Katrina Ligett, Kobbi Nissim, and Ayelet Gordon-Tapiero. Data co-ops. https://csrcl.huji.ac.il/book/data-co-ops, 2019.

[RDS+15]Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

[SDG+14]Beata Strack, Jonathan P DeShazo, Chris Gennings, Juan L Olmo, Sebastian Ventura, Krzysztof J Cios, and John N Clore. Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014, 2014.

[SGA+15]Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.

[Sha53] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.

[SR+88]Lloyd S Shapley, Alvin E Roth, et al. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press, 1988.

[SVI+16]Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.

[YGZ19] Gal Yona, Amirata Ghorbani, and James Zou. Who’s responsible? jointly quantifying the contribution of the learning algorithm and training data. arXiv preprint arXiv:1910.04214, 2019.

Here, we provide a high-level review of the axioms that Shapley used to describe an equitable valuation function [Sha53]. We consider the data Shapley setting, letting  φ(z; U, B) denote the value of  z ∈ Bfor a finite subset  B ⊆ Zwith respect to potential  U : Z∗ → [0, 1].

image

That is, if a data point contributes no marginal gain in potential to any nontrivial subset, then it receives no value.

image

That is, the value of a data point with respect to the combination of two tasks (addition of two potentials) is the sum of the values with respect to each task (potential) separately.

Theorem A.1 ( [Sha53] ). The Shapley value is the unique valuation function that satisfies the symmetry, null player, and additivity axioms.

Additionally, the Shapley value satisfies the desirable property that it allocates all of the value to the contributors.

image

It is straightforward to verify that the distributional Shapley value immediately inherits the properties of symmetry, null player, and additivity (by linearity of expectation). Further, it satisfies an on-average variant of efficiency.

Proposition A.2. Given a potential U and a data distribution  D, for m ∈ N,

image

Proof. We expand the expected distributional Shapley value with its definition and then apply linearity of expectation.

image

Proposition (Restatement of Proposition 2.4). Suppose D has bounded second moments. Then for z ∈ Z and m ∈ N, ν(z; Uµ, D, m)for mean estimation over D is given by

image

for an explicit constant  Cm = Θ(1)determined by m.

Proof. Consider the unsupervised learning task of mean estimation using the empirical estimator. Specifically, suppose we receive samples from some distribution D supported on  Rdwith mean µ = Es∼D[s] and bounded second moments. Given a subset  S ⊆ Rd, we consider the empirical estimator ˆµS = 1|S| · �s∈S s. We define a potential U(S) by the performance of the empirical estimator. For notational convenience, let  Es∼D�∥s − µ∥2�= R2 for some R = Θ(1).

image

By convention, we will assume that  U(∅) = 0. As such, we can evaluate the difference in potentialsas follows.

image

Importantly, note that we can relate ˆµS∪{z} to ˆµS.

image

Using these expressions, we can expand the distributional Shapley value into a form that will be convenient to work with.

image

We, thus, focus our efforts on bounding the summation from k = 2 to m. As such, we can evaluate the difference in potentials within the expectation as follows.

image

Taking an expectation over  S ∼ Dk−1, we can simplify each term in the summation separately; first, some identities that will be useful and hold for all  n ∈ N:

image

where (10) follows because ˆµSan unbiased estimator of  µ; (11) holds for all  q ∈ Rd; and (12) is a well-known fact that can be derived using (10) and (11).

Beginning with the first inner product.

image

where (13) applies (11) with  q = µ − zand (14) applies (12).

Expanding the next term.

image

where (15) follows by applying linearity of expectation and (10) to the first term and (12) to the second term.

Thus, in all, the value can be expressed as follows.

image

where �mk=2 1k·(k−1) = m−1mand we take  c(m) = �mk=2 1k2·(k−1). Thus, plugging this expression back into our original expansion.

image

where  C(m) = 2 − 1/m + c(m) = Θ(1) is an explicit function of m, and we use the fact that ES∼Dm�∥ˆµS − µ∥2�= 1m · R2.

Suppose  Z = X × Y; let Fto denote a Reproducing Kernel Hilbert Space (RKHS), with associated feature map  ϕ : X → F, inner product  ⟨·, ·⟩F, and norm ∥·∥F, such that for all  f ∈ F,

image

Given F, we define a natural metric over Z, where for labeled pairs  zi = (xi, yi) and zj = (xj, yj),

image

That is, the distance is given by the RKHS norm if  xi and xjhave the same label, and are arbitrarily dissimilar otherwise.

We define the potential of a subset  S ⊆ Zfor an RKHS learning problem as the performance achieved when training using S. Specifically, suppose  ℓ: [0, 1]  ×[0, 1]  → R+is an L-Lipschitz, convex loss function and D is a distribution supported on Z. We define the potential function UF : Z∗ → [0,1] in terms of the population loss over D of the following regularized ERM.

image

where erS(f) = 1|S|�(x,y)∈S ℓ(f(x), y) and λ > 0.

Lemma C.1 (Learning RKHS is Lipschitz stable). Suppose F is a RKHS with feature map φ : X → F. Let D be a distribution over  Z = X × Ysuch that  E(x,y)∼D [∥φ(x)∥F] = R. Then, UF : Z∗ → [0, 1] is (2L2R/λk)-Lipschitz stable with respect to  dF.

In other words, as with the standard notion of replacement stability, the Lipschitz stability depends on the Lipschitz constant of the loss, the expected norm over D, and (inversely on) the degree of regularization.

Proof. We follow the proof of replacement stability of RKHS from [Aga11] closely. For notational convenience, we drop reference to  F in ∥·∥ and ⟨·, ·⟩.

Suppose  S ∈ Zk−1and suppose  z, z′ ∈ Zare two points such that z = (x, y) and  z′ = (x′, y); i.e. they share the same label. By the definition of  dF, this is the only case we need to consider. Let f, f′ ∈ Fdenote the empirical risk minimizers over  S ∪ {z} and S ∪ {z′}, respectively.

image

For  α ∈ [0, 1], let

image

By the assumption that  ℓis convex, we can derive the following inequalities for any (x, y) ∈ Z andany  S ⊆ Z.

image

Note that  fαand  f′αare also feasible hypotheses in the Hilbert space. Thus, by the fact that  f, f′are ERMs,

image

Rearranging, and applying convexity, we derive the following inequality.

image

We can simplify the inner term of the LHS of (17) as follows.

image

Then, we can bound the RHS of (17) as follows.

image

where (18) follows by expanding according to (16), and then simplifying; (19) follows by the fact that erS∪{z}(f) = 1k · ℓ(f(x), y) +  k−1k erS(f), but the errors on S cancel to 0, erS(f′) − erS(f) + erS(f) − erS(f′); (20) follows by the Lipschitzness of  ℓ; and (21) follows by Cauchy-Schwarz.

The analysis above applied for any  α ∈ [0,1]; taking  α = 1/2 implies

image

Because  ℓ is L-Lipschitz, we obtain Lipschitz-stability of  UF.

image

We require the following standard concentration inequality.

Theorem (Hoeffding’s Inequality). Suppose  X1, . . . , XTare independent random variables, where Xtis bounded in the range [−bt, bt]. Let X = 1T�Tt=1 Xt. Then,

image

D.1 Iteration complexity of Algorithm 1.

Theorem (Restatement of Theorem 3.1). Fixing a potential U and distribution D, and  Z ⊆ Z, suppose  T ≥ Ω�log(|Z|/δ)ε2 �. Algorithm 1 produces unbiased estimates and with probability at least 1  − δ, |ν(z; U, D, m) − νT (z)| ≤ ε. for all z ∈ Z.

Proof. The theorem follows from the analysis of Theorem D.1, by taking the stability to be trivial, β(k) = 1, and using uniform sampling wk = 1/m for all k ∈ [m].

D.2 Running time analysis under stability.

Theorem D.1 (Generalizes Theorem 3.2). Suppose U is β(k)-deletion stable and for all sets  S ⊆ Zof cardinality |S| = k, U(S) can be evaluated in time R(k); consider a set of positive weights {wk : k ∈ [m]}and  p ∈[0, 1]. Algorithm 2 produces unbiased estimates of  ν(z; U, D, m) that with probability at least 1  − δ are ε-accurate for all  z ∈ Zpand runs in expected time

image

Proof. First, we bound the iteration complexity and time complexity to evaluate models at each iteration within Algorithm 2. The running time bound then follows by the fact that we need to evaluate a model per  z ∈ Zp, per iteration where the expected cardinality of  |Zp| = p · |Z|.

Abusing notation, denote by

image

Suppose we sample k according to a possibly non-uniform discrete distribution where  Pr[k ∈ m] = wk;we denote a random drawn from this distribution by  k ∼ [m]w. Then, by sampling  k ∼ [m]w, computing ∆zU(S) for S ∼ Dk, and reweighting, we obtain an unbiased estimate of  ν(z; U, D, m).

image

For simplicity, we analyze a sampling scheme where we sample  Tksets with cardinality k for Tk ≥ wk · T. (By the multiplicative Chernoff bound, this event will occur with high probability.) That is, for all  z ∈ Zand for each  k ∈ [m], we sample  Tksubsets  S ∼ Dkand compute ∆zU(S). For each  z ∈ Z, each such sample is an independent unbiased estimate of ∆zU(k), so reweighting according to  wkand averaging over  k ∈ [m] gives an unbiased estimate of  ν(z; U, D, m).

image

Note that by  β(k)-deletion stability, for each k, the terms in the summation associated with ∆zU(k)are bounded in magnitude by β(k)wkm. Thus, we can apply Hoeffding’s inequality to derive the following bound to obtain  ε-error with probability at least 1  − δ0.

image

Thus, taking the failure probability  δ0 = δ/ |Z|small enough to union bound over all  z ∈ Z, we derive the following bound on T.

image

Using this bound on T, we can compute the necessary running time for Algorithm 2 in terms of R(k) per z ∈ Zp.

image

Thus, we can compare various sampling schemes for different stability factors. Note that in the case of the uniform sampling scheme, where  wk = 1/m for all k ∈ [m], the sampling probabilities cancel.

image

Thus, the overall running time is given as  RTw(m) = p·|Z|· log(|Z|/δ)ε2m2 ·��mk=1β(k)2wk

image

Concretely, to see the special case of the theorem stated as Theorem 3.2, suppose that  β(k) = k−bfor  b ≥ 1/2 and R(k) = kc for c ≥1. With these settings of the parameters, uniform sampling takes time

image

Suppose, instead, we take  wk ∝ k1−2b; that is, we choose  wksuch that the first summation will still be bounded by  Hm = Θ(log(m)). Under such a sampling scheme, the running time is bounded as

image

In other words, the biased sampling scheme allows us to save roughly a factor-m2b−1 in computation time. Thus, if U is O(1/k)-deletion stable, then the biased sampling scheme saves roughly a factor m in computation time.

D.3 Finite Sample Approximation to D

While the analysis of Theorem D.1 focuses on the running time of Algorithm 2, we can equally interpret it as a sample complexity bound. In particular, taking R(k) = k corresponds to the sample complexity of taking a fresh sample  S ∼ Dkper iteration. Thus, using Algorithm 2, the naive sample complexity given by resampling for each iteration gives the bound of

image

Taking  β(k) = wk = 1/k yields

image

We report the results of additional empirical evaluations of Algoirthm 2, as described in Section 3.3. We depict the results of point removal experiments by plotting model performance vs. fraction of training data removed, where points are removed in order of their Shapley estimates or randomly. We apply the interpolation and weighted sampling speed-ups from Algorithm 2, independently. Both strategies obtain an order-of-magnitude speed-up with no qualitative performance change.

image

Supplementary Figure 1: Point removal curves for D-Shapley estimates under various speed-ups, using weighted sampling and interpolation across two ML tasks and three learning algorithms.

E.1 Speeding-up Distributional Shapley for Cifar10

In this experiment, we apply both speed-up methods to compute the distributional Shapley values for CIFAR10 dataset. We apply weighted sampling corresponding to a speed-up factor of 10 and subsampling with interpolation corresponding to a speed-up factor of 50, to compute values for 1000 data points. We valuate points based on their effect on an image classification task. We use an Inception-v3 [SVI+16] model, pretrained on the ILSVRC2012 (Imagenet) [RDS+15] dataset; then, we base our potential function U(S) on the performance of the model resulting after retraining the final layer of the network (holding all other layers fixed) using the retraining set S. Figure 2 shows the point-removal results for the complete dataset (e.g. removing 50% of the points with the highest D-Shapley value causes the prediction accuracy to drop from 77% to 68%.)

image

Supplementary Figure 2: Point removal experiment for CIFAR10, using distributional Shapley estimates computed by Fast-D-Shpaley.

We report the complete results for the caze study from Section 4. We use four large-scale datasets from the UCI repository [DG17]:

(1) Covertype dataset with 581, 012 samples where each sample contains 54 visual features of forest images and the task is to detect the type of forest cover (from a set of 7 different covers). We use a Random Forest model.

(2) Diabetes130 dataset [SDG+14] that with 100, 000 samples. Each sample contains 54 patient and hospital features. The task is to predict whether the patient will be readmitted to the hospital. We use an AdaBoost model.

(3) Wearable Computing: Classification of Body Postures and Movements (PUC-Rio) Data Set which has 165, 632 points where each point has 18 attributes and has one of the 5 postures. We use a multinomial logistic regression model.

(4) Dataset for Sensorless Drive Diagnosis Data Set that contains 58, 509 data points. Each data point has 48 features. The dataset has 11 classes. We use a Gradient Boosting model.

image

Supplementary Figure 3: Points from an acquired set are added to the buyer’s initial dataset in three different orders: according to  ν (D-Shapley), according to  φ(TMC), and randomly. The plot shows the change in the accuracy of the model, relative to its performance using the buyer’s initial dataset, as the points are added; shading indicates standard error of the mean.


Designed for Accessibility and to further Open Science