In database terminology, this paper is about efficient incremental view updates, specifically about using provenance annotations to propagate the effect of deletions from the input data to the output. However, the views that we consider are regression models (linear and binomial/multinomial logistic regression) and the input data consists of the samples used to train these models.
The need for incremental techniques to efficiently update regression models arises in several contexts, for example data cleaning and interpretability. Data cleaning has been extensively studied by the database community [8, 12, 17, 46], and is typically an iterative and interactive process, allowing data analysts to alternate between analysis and cleaning tasks, as well as to interact with other parties such as IT staff and data curators [33]. Machine learning techniques are particularly sensitive to dirty data in training datasets, since it can result in erroneous models and counter-intuitive predictions for test datasets [8]. A number of techniques have therefore recently been proposed for detecting and repairing dirty data in machine learning, e.g., [25, 32]. The work presented in this paper can be incorporated into these data cleaning pipelines by assuming that dirty data in the training set has already been detected, and addresses the next step by providing a solution for incrementally updating the machine learning model after the dirty data is removed.
ing (see, for example, the general discussions in [15, 38], the extensive human subjects experiments in [45], as well the many references in these papers). The problem is being studied from several different perspectives (see Sec. 2). The data-driven approaches of [15, 35] discover factors of interpretability by performing repeated retraining of models using multiple different subsets of a training dataset to understand the relationship between samples with certain feature characteristics and the model behavior. Such repeated retraining also occurs in model debugging [25, 28, 34] and deletion diagnostics [9].
In this respect, our work shares goals with [30], which develops an influence function to approximately quantify the influence of a single training sample on the model parameters and prediction results; this can also be used for estimating the model parameter change after the removal of one training sample. However, extending the influence function approach to multiple training samples significantly weakens prediction accuracy. In contrast, our techniques are not only efficient but significantly more accurate.
Connection to Provenance. Note that the problem of incrementally updating the model after removing a subset of the training samples can be seen as a question of data provenance [4, 7, 20]. Data provenance tracks the dependencies between input and output data; in particular, the provenance semiring framework [20, 21] has been used for applying incremental updates (specifically deletions) to views.
In the semiring framework, input data is annotated with provenance tokens which are carried through the operators performed on the data (e.g. select, project, join, union). Output data is then annotated with provenance polynomials expressed in terms of the provenance tokens. When an input tuple is deleted, the effect on the output can be efficiently calculated by essentially “zeroing out” its token in the provenance polynomial. Recently, the framework has been extended to include basic linear algebra operations: matrix addition and multiplication [52]. In this extension, the provenance polynomials play the role of scalars and multiplication with scalars plays the role of annotating matrices and vectors with provenance.
As an example, suppose that p,q,r,s are provenance tokens that annotate samples in a training dataset. Our methods will show that vectors of interest (such as the vector of model parameters) can be expressed with provenance-annotated expressions such as:
Here, u, v, z are numerical vectors signifying contributions to the answer w and they are annotated (algebraic operation ∗) with which are provenance polynomials to be read as follows: the provenance
represents the use of both data items labeled p and q and, in fact, the first item is used twice. Now suppose the data item annotated with r is deleted while those annotated p,q,s are retained. We can express the updated value of w under this deletion by setting r to the “provenance 0 polynomial”, denoted 0prov which signifies absence, and p,q,s to the “provenance 1 polynomial”, denoted 1prov, which signifies “neutral” presence, no need to track further. The algebraic properties of provenance polynomials and of their annotation of matrices/vectors ensure what one would expect, e.g, 0prov
as well as 0prov
(the all-zero vector) and 1prov
. It follows that under this deletion w = u + z.
Approach. In this paper, we use the extension of the semiring framework to matrix operations to track the provenance of input samples through the training of logistic regression and linear regression models using gradient descent and its variants. In each iteration of the training phase, a gradient-based “update rule” updates the model parameters, which can be annotated with provenance polynomials. For logistic regression, we can achieve this via piecewise linear interpolation over the non-linear components in the gradient update rule.
In addition to enabling provenance tracking, the linearization of the gradient update rule allows us to separate the contributions of the training samples from the contributions of the model parameters from the previous iteration. As a result, the effect of deleting training samples on the gradient update rule can be obtained by “zeroing out” the provenance tokens corresponding to those samples.
Challenges. Reasoning over provenance to enable incremental updates introduces significant overhead in the gradient descent calculation. To speed up incremental updates over model parameters for dense datasets, we use several optimizations in our implementation, PrIU: First, between iterations during the training phase over the full training dataset, we cache intermediate results (some matrix expression) that capture only the contribution of the training samples. These are annotated with provenance. Then during the model update phase, the propagation of the deletion of a subset of samples comes down to a subtraction of the "zeroedout" contributions of the removed samples. Second, we apply singular value decomposition (SVD) over the intermediate results to reduce their dimensions. An optimized version of PrIU, PrIU-opt, is also designed for further optimizations over datasets with small feature sets using incremental updates to eigenvalues. (For logistic regression, it is used by terminating provenance tracking early when provenance expressions stabilize. See Section 5 for more details). But the optimizations above cannot work for sparse datasets, for which we use only the linearization of the update rule for logistic regression.
As we shall see, PrIU and PrIU-opt can lead to speed-ups of up to 2 orders of magnitude when compared to a baseline of retraining the model from the updated input data; however, for sparse datasets the speedup is only 10%. While the practical impact of this speed-up may be small for an engineer who only deletes one subset of training samples, especially if retraining takes only a few minutes, the impact is much greater for an engineer who repeatedly removes multiple different subsets of training samples, e.g. when exploring factors of interpretability. In this case, even one order of magnitude speed-up reduces exploration from several hours to a few minutes.
Contributions of this paper include:
(1) A theoretical framework which enables data provenance to be tracked and used for fast incremental model updates when subsets of training samples are removed. The framework extends the approach in [19, 20, 52] to linear regression and (binary and multinomial) logistic regression models.
(2) Analytical results showing the convergence and accuracy of the updated model parameters for logistic regression, which are approximately computed by applying piecewise linear interpolation over the non-linear operations in the model parameter update rules.
(3) Efficient provenance-based algorithms, PrIU and PrIUopt, which achieve fast model updates after removing subsets of training samples.
(4) Extensive experiments showing the effectiveness and accuracy of PrIU and PrIU-opt in incrementally updating the linear regression and logistic regression models compared to the straightforward approach of retraining from scratch, as well as compared to implementing an extension of the influence function in [30].
(5) Enabling work on interpretability that seeks to understand the effect of removing subsets of the training data, rather than just of a single training sample.
The remainder of the paper is organized as follows. In Section 2, we describe related work in incremental model maintenance, data provenance, data cleaning, and machine learning model interpretability. Section 3 reviews the basic concepts of linear regression and logistic regression. The theoretical development of how to use provenance in the update rules of linear regression and logistic regression is presented in Section 4, and its implementation provided in Section 5. Experimental results comparing our approach to other solutions are presented in Section 6. We conclude in Section 7.
To our knowledge, this is the first work to use provenance for the purpose of incrementally updating machine learning model parameters.
Incremental model maintenance. There have been several proposals for materializing machine learning models for future reuse. [13, 22, 40] target the problem of efficiently updating the model as the training data changes, which focus primarily on linear regression and Naive Bayes models, and use closed-form solutions (rather than iterative algorithms, e.g., gradient-based approaches) of the model parameters to determine incremental updates in light of additions and deletions of training samples while [23] deals with how to merge prematerialized models to construct new models based on user requests. In addition, [22] also deals with incremental updates of the model parameters based on the Mixture Weight Methods (a variant of gradient descent) for logistic regression. The method, however, puts additional training samples into another batch and averages the pre-computed parameters derived from other batches (over the original data) with the parameters computed over the additional batch. This cannot be used for incremental deletions which is our focus in this paper.
The basic ideas of [13, 22, 40] on how to incrementally update linear regression models are somewhat similar. Due to the existence of the matrix inverse operations in the closed-form solution for linear regression, only the intermediate results built with linear operations are maintained as views. They are updated when insertion or deletion happens in the input training data. After that, matrix inversion is used to compute the final updated model parameters. In contrast, our approach proceeds directly to a gradient descent-based linear regression. As we shall see, our experiments show that our approach is more efficient than the closed-form update.
Data provenance. Data provenance captures where data comes from and how it is processed. Within the database community, various approaches have been proposed to track provenance through queries, e.g. where and why provenance [4], and semiring provenance [2, 20]. Provenance is used to identify the source of errors in computational processes, such as workflows [1] and network diagnostics [53]. It is also used to support efficient incremental updates through database queries and schema mappings [18, 19, 26] and workflow computation [16]. Provenance support for linear algebra operations in the context of machine learning tasks has also been recently studied [52]. This work was mentioned in [5] as a first step in using data provenance for interpretability of machine learning models.
Data cleaning. The goal of data cleaning is to detect and fix errors in data, and is a crucial step in preparing data for data analytics/machine learning tasks [14, 34]. However, if erroneous/dirty data is detected after the model has been trained, the machine learning algorithm must be rerun to obtain the updated model parameters. This repetitive training can cause significant delays when large volumes of data are processed. One approach is to start each training phase by setting the initial model parameters to the ones generated by the previous training phase over the dirty data [34]. Our contribution is orthogonal to this approach, and updates the machine learning model parameters directly by reasoning over provenance rather than retraining from scratch.
Interpreting and understanding ML models. Fully understanding the behavior of ML models, especially deep neural network models, is difficult due to their complexity. Moreover, there are different perspectives on what we should understand. For example, one approach separates model components into “shape” functions, one for each feature, in generalized additive models, in particular for linear and logistic regression [6, 39]. Closest to our perspective is the idea of influence function [30] (similar problem is also mentioned in [44]), which originates from deletion diagnostics in statistics [9]. [30] estimates the effect of removing a single training sample on the already obtained model, without retraining the model. The influence function uses the Taylor expansion of the derivative of a customized objective function for the model parameter. The calculation (and thus the approximation of model parameter change) is only based on lower-order terms in Taylor expansion.
This can be seen as a method for incremental model update for just one sample deletion. In fact, we have observed that the method could be extended to deleting an arbitrary number of samples, which led us to compare it experimentally to our approach. The results (see Section 6) show that this approach leads to very inaccurate results when multiple training samples are deleted.
We give an overview of linear and logistic regression along with the gradient-based method for learning model parameters. Assume a training dataset (matrix representing the feature matrix while Y is an
1 vector representing the labels, i.e.:
For both linear and logistic regression we only focus on a common case: regularization. The objective functions of linear regression, binary logistic regression and multinomial logistic regression with
regularization are presented in Equations 2-4 respectively 1
where w is the vector of model parameters and larization rate. For simplicity, we denote
for multinomial logistic regression where q represents the number of possible classes. Typical learning methods for computing w are to apply gradient descent (GD) or its variant, stochastic gradient descent (SGD) or mini-batch stochastic gradient method (mb-SGD) [47] to minimize the objective function h(w) iteratively. GD, SGD and mb-SGD are the same in nature since mb-SGD can be regarded as a generalization of GD and SGD. They are therefore called Gradient-based method (GBM) for short, and hereafter we will only take mb-SGD as an example. Considering the similarities between binary logistic regression and multinomial logistic regression and the complexity of the computation related to the latter one, we will only present the formulas related to binary logistic regression below. All the theorems that hold for binary logistic regression can be also proven to be true for multinomial logistic regression.
At each iteration, mb-SGD updates the by using the average gradient of h(w) over a randomly selected mini-batch from the training dataset. Specifically, for linear regression and logistic regression, the rule for updating
under mb-SGD is presented below (Equations 5 and 6 respectively):
where is called the learning rate and
represents a mini-batch of B training samples. For SGD,
includes only one sample (B = 1), while for GD,
includes all the training samples (B = n).
In this section, we first discuss the annotation with provenance of the gradient-bases update rules in our approach. Next, we discuss for the non-linear operations in logistic regression the linearization that makes our provenance annotation framework usable. Finally, we give a rigorous theoretical analysis of the convergence of the iterative process with provenance-annotated update rules for both linear and logistic regression model and the similarity to the expected results after linearization for logistic regression models.
4.1 Provenance annotations for matrices
In the semiring framework [2, 20, 21] one begins by annotating input data with elements of a set T of provenance tokens. These annotations are then propagated through query operators as they combine according to two operations: “records alternative use of information, as in relational union or projection, and “
”, that records joint use of information, as in relational join. With these, the annotations become provenance polynomials whose indeterminates are tokens and with coefficients in N. For example, the monomial
the provenance of a result for which the data item annotated p was used twice together with the item annotated q used once. We denote the set of polynomials by N[T].
In the extension of the framework to matrix algebra [52], annotation formally becomes a multiplication of vectors with scalars as in linear algebra. The role of scalars is played by provenance polynomials and the role of vectors, of course, is played by matrices (generalizing their row vectors and the transposes of these).
Matrices annotated with provenance polynomials form a nice algebraic structure that extends matrix multiplication and addition. We denote multiplication with scalars by “” writing
for the matrix A annotated with the provenance polynomial p. For space reasons we cannot repeat here the technical development in [52], however, we mention a crucial algebraic property of annotated matrix multiplication, which also illustrates combining provenance in joint use:
We apply this framework to tracking input training samples through GBM’s in which the update involves only matrix multiplication and addition. Let the training dataset be (X, Y) where X is an feature matrix and Y is an
1 column vector of sample labels. For i = 1, . . . ,n, we annotate every sample
’th rows in X respectively Y) with a distinct provenance token
. Next, we decompose X and Y as algebraic expressions in terms of
and some matrices made up of the reals 0 and 1. These “helper” matrices are annotated with the provenance polynomial 1prov
only a term of degree zero which is the natural number 1) meaning “always available, no need to track”. We illustrate with the provenance-annotated X when n = 2:
When X is transposed, a similar decomposition applies in terms of the annotated column vectors . We also note that the algebra of annotated matrices follows the same laws as the usual matrix algebra. Consequently, we can perform in the algebra of provenance-annotated matrices the calculations involved in the gradient-based update rules. For illustration, a calculation involving X that without provenance takes the form
(where
are some real numbers) becomes with provenance annotations
And here is the provenance-annotated expression for the update rule of linear regression (i.e. Equation 5):
where represents the provenance-annotated expression for the vector
of model parameters while
resents a provenance-annotated expression for the number of samples in the min-batch
, for example, following the approach to aggregation in [2],
In the semiring framework there is no division operation so we used fractions with denominator in Equation 7 only for notational purposes. As we shall see immediately below, in incremental update
can be replaced with an integer.
As with the other applications of the semiring framework, deletion propagation is done by "zeroing-out" the deleted samples. That is, if sample i is deleted we set the corresponding provenance token 0prov
(has only a term of degree zero which is the natural number 0). The challenge, as detailed in the following section is how to do this efficiently throughout the gradient descent.
For the samples that remain we obtain (after we stop the iterations) a provenance-annotated expression that can be put in the form the provenance tokens and each
is a vector of contributions to the model parameters. To get the updated vector of model parameters we set each remaining provenance token to 1prov obtaining
. And, as promised, we notice that when all the provenance tokens are set to 0prov or 1prov the provenance expression
comes down to an integer. Denoting this integer by
and denoting the set of the indexes of the removed training samples by R, the provenance-annotated update rule for
4.2 Linearization for logistic regression
The model in [52] supports tracking provenance through matrix addition and multiplication. In order to apply it to GBM for logistic expression, we linearize, using piecewise linear interpolation, the non-linear operations in the corresponding update rules, i.e. Equation 6.
In Equation 6, the non-linear operations can be abstracted as , where the value of the product
is assigned to the variable x in Equation 6. Then f (x) can be approximated by applying 1-D piecewise linear interpolation [31]. So for each
and
can be approximated by
, where
are the linear coefficients produced by the linearizations, which depends on which sub-interval (defined by piecewise linear interpolation) the value of
locates and thus should be varied between different
and different
(see the associated superscript).
Throughout the paper, we will consider the case in which the variable x in f (x) is defined within an interval (a = 20) that is equally partitioned into 106 sub-intervals; for
, we assume that s(x) is a constant since when |x| > a, the value of f (x) is very close to its bound (0 or 1). We will show that the length of each sub-interval influences the approximation rate.
In terms of multinomial logistic regression, the non-linear operations in its update rule is the softmax function, which is a vector-valued function and thus requires piecewise linear interpolation in multiple dimensions, which can be achieved by using the interpolation method proposed in [51].
After the interpolation step over the update rules for binary logistic regression, Equation 6 is approximated as:
in which represents the model parameter after lineariza- tion at
iteration. By annotating each training sample
with provenance token
and by taking the similar derivation of Equation 8, after the removal of the subset of training samples the provenance expression becomes:
By setting all the in Equation 10 as 1prov, we can get the update rule for the updated model parameter
4.3 Convergence analysis for provenance-annotated iterations
One concern in using GBM is whether the model parameters ultimately converge. This has been extensively studied in the machine learning community [3, 29, 36, 48, 49]. In [3], convergence conditions have been provided for GD and SGD over strong convex objective functions. Those convergence conditions can exactly fit linear regression and logistic regression with L2-regularization because their objective functions are strong convex.
A similar concern occurs when GBM is coupled with provenance, i.e. whether the provenance expression tion 8 and
in Equation 10 converge in the case when the original model parameter
converges. We propose the following definition for the convergence of provenance-annotated expressions.
expressions. The expression when
iff every matrix
converges when
As mentioned before, we hope that the convergence of and
can be achieved when
can converge. The convergence conditions of
are presented below:
Lemma 1. Convergence conditions for general mb-SGD. [3] Given an objective function Lipschitz continuous and
strong convex once the learning rate
satisfies: 1)
is a constant across all the iterations (denoted by
converges when mb-SGD is used.
Unfortunately, our theoretical analysis shows that there is no convergence guarantee for and
under the convergence conditions from Lemma 1, i.e.:
Theorem 2. in Equation 8 and
in Equation 10 need not converge under the conditions in Lemma 1. 2
However, in Equation 8 and
in Equation 10 converge under the conditions in Lemma 1 with one more assumption about the provenance expression, i.e.:
Theorem 3. The expectation of in Equation 8 and of
in Equation 10, converge when
if we also assume that provenance polynomial multiplication is idempotent.
Intuitively speaking, the assumption of multiplication idempotence for provenance polynomials means that we do not track multiple joint uses of the same data sample, which is not problematic for deletion propagation.
4.4 Accuracy analysis for linearized logistic regression
The next question is whether the approximated model parameters after linearization of Equation 6 (i.e. in Equation 9) is close enough to the real model parameters from Equation 6. By following the approximation property of piecewise linear interpolation, we can prove that the distance between
is very small.
Theorem 4. is bounded by
where
is an arbitrarily small value representing the length of the longest sub-interval used in piecewise linear interpolations.
Furthermore, in terms of the updated model parameters for logistic regression, we also need to guarantee that the updated parameters are close to the real updated model parameters without linearization (denoted by
Recall that . Note that the linear coefficients
and
in Equation 11 are actually derived in the training phase where all samples exist (rather than in the model update phase), which implies that a larger difference between
should be expected. Surprisingly, we can prove that the distance between
and
is still small enough.
Theorem 5. is bounded by
is the number of the removed samples and
is defined in Theorem 4.
We now discuss how the ideas in the previous section are implemented in PrIU and PrIU-opt, for both linear and logistic regression. Along the way, time and space complexity analyses, as well as theorems that justify our approximation strategies used in PrIU and PrIU-opt, are provided.
5.1 PrIU: Linear regression
In Equation 8, by setting all the as 1prov, the expression
, in which the first term,
, can be regarded as provenance information and thus cached as an intermediate result for each mini-batch during the training phase for the original model parameter
. Thus we only need to compute the latter term during the incremental update phase.
can be computed in a similar way. In the end, Equation 8 is then rewritten as follows for the purpose of incremental updates:
Note that can be rewritten into matrix form, i.e.
is a matrix consisting of the removed samples in the mini-batch
. The associativity property of matrix multiplication can also be used to avoid expensive matrix-matrix multiplications (i.e.
by conducting more efficient matrix-vector multiplications instead (e.g. computing
first and then multiplying the result by
Suppose that samples are removed from each mini-batch on average, then the time complexity of updating the model parameters in each iteration using Equation 13 will be
(recall that the dimension of
trast, the time complexity for retraining from scratch (i.e. not caching
in Equation 13) will be
Of course, performance predictions based on asymptotic complexity give only very rough guidance, and we conduct experiments for realistic assessments. Still, the bounds above suggest, for example, that for small
and
incremental deletions with PrIU work better than retraining (and our experiments verify this, see Section 6).
Typically, however, a smaller mini-batch size B is used. To deal with the case in which m > B, we notice that the rank of the intermediate result, should be no more than B, thus smaller than m when B is smaller than m. This motivates us to reduce the dimension of the intermediate results using SVD, i.e.
where
is a diagonal matrix whose diagonal elements represent the singular values, while
and
are the left and right singular vectors. Suppose after SVD, we only keep the r largest singular values and the corresponding singular vectors where
is approximated by
(
represents the submatrix com- posed of the first r columns and
is a diagonal matrix composed of the first r eigenvalues in
). Thus Equation 13 is rewritten as:
Here we can cache the results of (denoted by
) and
for efficient updates, both of which have dimensions
Time complexity. The time complexity to update the model parameters using this approach is for each iteration since the computation time is dominated by the matrix-vector computation, e.g. the multiplications of
and
. This is more efficient than retraining from scratch, which has time complexity
. So the total complexity for PrIU is
, where
is the number of iterations in the training phase.
Space complexity. Using this approximation, at each iteration we only need to cache , which require space O(rm). So the total space complexity will be
for
iterations.
Theorem 6. Approximation ratio Under the convergence conditions for should be bounded by some constant C. Suppose
where
is a small value, then the change of model parameters caused by the approximation will be bounded by
This shows that with proper choice of r in the SVD approximation, the updated model parameters computed by PrIU or PrIU-opt should be still very close to the expected result. So in our implementations, r is chosen based on 0.01) such that the inequality in Theorem 6 is satisfied.
5.2 PrIU-opt: Optimizations for linear regression
When the feature space is small, additional optimizations can be used for linear regression. Note that according to [29], the model parameters derived by both SGD and mb-SGD will end up with statistically the same results as GD. This means that the update rule in Equation 5 and Equation 13 could be approximated by its alternative using GD, i.e.:
in which represent the removed samples while
represents the number of those samples. Let M and N denote
respectively. Then eigenvalue decomposition can be applied over
represents the eigenvalues of M). This is then plugged into Equation 15 and computed recursively, which results in the following formula:
This indicates that once the eigenvalues and eigenvectors of each M are given, we can derive by simply computing the product,
, and the sum of the product,
, on diagonal entries. The overhead of this is only
(recall that
represents the total iteration number), and thus we avoid the repetitive matrix multiplication operations through the for-loops. Also, observe that
can be regarded as a small change over
is small. Thus we can use the results on incremental updates over eigenvalues in [41], i.e. when the difference between the eigenvectors of
and that of M is negligible then the eigenvalues of
are estimated as:
Here, represents the approximated
eigenvalue of
. It indicates that we can apply eigenvalue decomposition over M offline before the model incremental update phase and use Equation 18 to get the updated eigenvalues online.
Time complexity. The time complexity for updating the model parameters is dominated by the computation of , which is followed by the computation over each
tion 17 does. These have time complexities
and
, respectively. So the total time complexity is
, which can be more efficient than the closed-form solution (see experiments in Section 6).
Space complexity. The method avoids caching the provenance information at each iteration, and only requires caching and all the eigenvalues
, which takes space
This shows that with small number of removed samples, the approximation ratio should be very small.
5.3 PrIU: Logistic regression
As the first step of the implementation of PrIU for logistic regression, non-linear operations are linearized using piecewise linear interpolation. Then, based on the analysis in Section 4, given the ids of the samples to be removed in dense datasets, R, Equation 11 is rewritten as follows:
where
Similar to linear regression, the intermediate results and
are cached and the dimension of
can be reduced by using SVD before the model update phase, which can happen offline. Suppose after SVD,
, in which
are two matrices with dimension
In the end, Equation 19 is modified as below for incremental model updates:
Time complexity. To apply Equation 20 in the model update phase, the computation of become the major overhead, which have time complexity O(rm) and
, respectively. Suppose there are
iterations in total, then the total time complexity is
. In comparison, the time complexity of retraining from scratch is
, where
represents the overhead of the non-linear operations. When
and
, we can therefore expect PrIU to be more efficient than retraining from scratch.
Space complexity analysis Through this approximation, we need to cache at each iteration, which requires
space in total. Plus,
extra space is necessary to cache the linear coefficients. So the total space complexity will be
the deviation caused by the SVD approximation will be bounded by , given the ratio
. So using Theorem 5,
is bounded by
This indicates that should be very close to
ilar to the discussion after Theorem 6).
Discussion Notice that for sparse datasets with large feature space, we can utilize the efficient sparse matrix operations by retraining from scratch. Also note that the intermediate result will be a sparse matrix for such datasets. However, after SVD, there is no guarantee that
are sparse matrices. Therefore, for sparse training datasets, we will simply use the linearized update rule, i.e. Equation 11 directly, without considering the strategies above.
5.4 PrIU-opt: Optimizations for logistic regression
Again, when the feature space is small additional optimizations are possible. In particular, we observe that for each sample i the change in the coefficients and
from one iteration to the next becomes smaller and smaller as
converges. This suggests that we can stop capturing new provenance information at some earlier iteration, call it , and continue with the same provenance until convergence. Suppose that for each sample i we approximate
by
after the iteration
. Therefore the matrices
and
will be approximated using the coefficients
and
and will remain the same for all iterations
allowing us to avoid their recomputation. In the experiments, we found that a rule of thumb that takes
to be 70% of the total number of iterations works well.
This has the same form as for linear regression, motivating us to use the same techniques from PrIU-opt for linear regression, i.e. conducting eigenvalue decomposition over , followed by incrementally updating the eigenvalues given the changes
, thus avoiding recomputations after the iteration
Space complexity. After the iteration , we only need to keep the eigenvectors of
, which requires
Including the space overhead for the first
iterations, the total space complexity is
logistic models with L2 regularization. Our solutions cannot handle L1 regularization since in this case the gradient of the objective function is not continuous, thus invalidating some of the error bound analysis above. How to handle L1 regularization will be our future work.
6.1 Experimental setup
Platform. We conduct extensive experiments in Python 3.6 and use PyTorch 1.3.0 [42] for the experiments for dense datasets and scipy 1.3.1 [27] for the experiments for sparse datasets. All experiments were conducted on a Linux server with an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and 64GB of main memory.
the Kaggle ECG Heartbeat Categorization Dataset7; (6) the CIFAR-10 dataset 8, which are referenced as SGEMM, Cov, HIGGS, RCV1, Heartbeat and cifar10 hereafter.
SGEMM has continuous label values, therefore we use it in experiments with linear regression while the rest of them have values that are appropriate for classification. Each dataset is partitioned into training (90% of the samples) and validation (10% of the samples) datasets, the latter used for measuring the accuracy of models trained from the former.
The characteristics of these datasets are listed in Table 1, which indicates that RCV1 and cifar10 have extremely large feature space (over 30k model parameters) while other datasets have much fewer parameters (Heartbeat has around 1000 while others have less than 500).
6.2 Experiment design
We conduct two sets of experiments, the first of which aims to evaluate the performance of PrIU and PrIU-opt with respect to the deletion of one subset of the training samples. We do this over different types of datasets (dense VS sparse, large feature space VS small) with varied configurations (how many samples to be removed, mini-batch size, iteration numbers etc.), and compare against retraining from scratch. The second set of experiments simulate the scenario where users repetitively remove different subsets of training samples.
In the first set of experiments we simulate the cleaning scenario. To specify the samples to be removed from the training datasets, we introduce dirty samples, which are a selected subset of samples from the original dataset T that are modified to incorrect values by rescaling. The resulting dataset is denoted , over which the initial model
is constructed. The dirty samples are then removed in the model update phase. The goal is to compare the robustness of PrIU, PrIU-opt and the influence function [30] method when dirty data exists. In the experiment we vary the number of erroneous samples generated. The ratio between the erroneous samples and the original training dataset is called the deletion rate, and we give it values ranging from 0.0001 (i.e. 0.01%) to 0.2 (i.e. 20%).
In the second set of the experiments, we simulate the scenario in which users debug or interpret models by removing different subsets of samples, necessitating repeated incremental model update operations. We assume that the datasets are very large; to simulate this, we create three synthetic datasets by concatenating 4 copies of HIGGS, 20 copies of Cov and 130 copies of Heartbeat such that the total number of training samples is around 40 million, 11 million and 11 million, respectively, which are denoted HIGGS (extended), Cov (extended) and Heartbeat (extended), respectively. In the experiments, ten different subsets are removed and for each of them the deletion rate is about 0.1% of randomly picked samples out of the full training set. The hyperparameters for this set of experiments are listed in Table 2.
Baseline. For both and
, we simulate what users (presumably unaware of errors) would do, and train an initial model
using the following standard method: Manually derive the formula for the gradient of the objective function and then program explicitly the GBM iterations. The erroneous or chosen samples are then removed from
. For linear regression (except for GBM), we also compare PrIU and PrIU-opt against close-form formula solutions for incremental updates [13, 22, 23, 40], denoted by Closed-form.
Incrementality. To update the model , the straightforward solution is to retrain from the scratch by using the same standard method as before but exclude the removed samples from each mini-batch. We denote this solution by BaseL. In contrast, our approach uses PrIU or PrIU-opt to incrementally update the model. The time taken by BaseL, PrIU or PrIU-opt to produce the updated model is reported in the experiments as the update time, and is compared over the two solutions: retraining with BaseL vs. incremental update with PrIU or PrIU-opt.
Note that our PrIU/ PrIU-opt approach uses provenance information collected from the whole training dataset. This phase is offline for the PrIU/ PrIU-opt algorithms and is not included in their reported running times. In practice, for the first set of experiments (cleaning of erroneous samples) provenance collection is done during the training of from
. For the second set of experiments (repeated deletions of subsets for debugging or interpretability) provenance collection is done during an initial training of
entire dataset
, which only needs to be done once even if many deletions of subsets are performed subsequently.
Since PrIU-opt is the optimized version for datasets with small feature space we only record the update time of PrIU over RCV1 and cifar10, which have very large feature spaces.
Accuracy. We compare the quality of the updated model obtained by BaseL and PrIU/PrIU-opt. The goal is to show that the improvement in update time is not achieved at the expense of accuracy. For experiments with linear regression, we use the mean squared error (MSE) over the validation datasets as a measure for accuracy. A lower MSE corresponds to higher accuracy over the validation set. For experiments with binary or multinomial logistic regression, we use the updated model to classify the samples in the validation datasets and report their validation accuracy.
Model comparison. We also compare the updated models structurally by comparing the vector of updated model parameters obtained via PrIU/PrIU-opt against the ones by using BaseL. This is done in two different ways: 1) Using distance, that is, the L2-norm of the difference between the two vectors, for both linear and logistic regression, and 2) Using similarity, that is, the cosine of the angle between the two vectors. The latter is only done for logistic regression since the angle is only relevant for classification techniques. For both linear regression and logistic regression, we also record the changes of the signs and magnitude of individual coordinate of the updated model parameters by PrIU and PrIU-opt compared to the ones obtained by BaseL.
Comparison with influence function. As indicated in Section 2, the influence function method in [30] can be extended to handle the removal of multiple training samples by us (details omitted). We denote the resulting method INFL and compare it against PrIU/PrIU-opt in the experiments. We predicted and verified experimentally that this approach produces models with poor validation accuracy since the derivation of INFL relies on the approximation of the Taylor expansion, which can be inaccurate. We also notice that the Taylor expansion used in INFL involves the computation of the Hessian matrix, which is very expensive for datasets with extremely large feature space. So we did not run INFL over RCV1 and cifar10 in the experiments; the comparison between PrIU/PrIU-opt and INFL over other datasets is enough to show the benefits of our approaches.
As discussed in Section 5, the performance of PrIU and PrIUopt is influenced by the mini-batch size, the number of iterations and the size of the feature space. To explore the effect of the first two parameters for logistic regression, three different combinations of mini-batch size and number of iterations are used over Cov, denoted Cov (small), Cov (large 1) and Cov (large 2) (see Table 2). Since the datasets used for logistic regression have different feature space sizes, the performance difference with respect to feature space size is also compared. Since there is only one dataset for linear regression, SGEMM, we extend this dataset by adding 1500 random features for each sample to determine the effect of feature space size. The extended version of SGEMM is denoted SGEMM (extended) (see Table 2). Other hyperparameters used in the experiments are shown in Table 2. Note that since erroneous samples exist in the training datasets for the first set of experiments, some values of the learning rate need to be very small to make sure that the convergence can be reached.
In the experiments, we answer the following questions:
(Q1) Do the optimizations used in PrIU-opt compared to PrIU lead to a significant improvement in update time without sacrificing accuracy when the number of features in the training set is small?
(Q2) Do PrIU and PrIU-opt afford significant gains in efficiency compared to BaseL?
(Q3) Are the efficiency gains provided by PrIU and PrIUopt achieved without sacrificing the accuracy of the updated model?
(Q4) Can we experimentally validate the theoretical analysis in Sections 4.4 and 5, i.e. that the updated model derived through the approximations in PrIU and PrIUopt is very close to the one obtained by BaseL?
(Q5) Does the influence function approach, INFL, provide a competitive alternative to PrIU and PrIU-opt?
(Q6) Can we experimentally show the effect of the hyperparameters, such as mini-batch size and iteration numbers over the performance gains of PrIU and PrIU-opt?
(Q7) Can we experimentally show the effect of the feature space size (i.e. the number of model parameters, which equals to the feature number times the number of classes for multi-nomial logistic regression)?
(Q8) What is the memory overhead of PrIU and PrIU-opt for caching the provenance information?
6.3 Experimental results
We report the results of our experiments in this subsection.
(Q1) We compare the update time of PrIU and PrIU-opt for linear regression using SGEMM (extended) in Figure 1b. The results show that the update time of PrIU-opt is significantly better than that of PrIU except when the deletion rate is approaching 20%. We also see from Table 4 that PrIU-opt and BaseL yield models that have exactly the same validation accuracy. Therefore, although PrIU-opt uses additional
Figure 1: Update time using linear regression
Figure 2: Update time using logistic regression over Cov and the hyperparameters from Table 2
Figure 3: Update time using logistic regression
Table 1: Summary of datasets
approximations for optimization, they do not hurt the predictive power of the updated models. This shows that the optimization strategies in Sections 5.2 and 5.4 are worth the design and implementation effort. Consequently, we will only compare PrIU-opt against other approaches except for cifar10 and RCV1 which have extremely large feature spaces.
(Q2) Figures 1a-1b compare the update time in BaseL and PrIU-opt using linear regression (ignore the INFL lines for the moment), while Figures 2-3 show the same results for logistic regression for single model update operation. Observe that for both linear and logistic regression, when the deletion rate is small (<0.01), PrIU-opt can achieve significant speed-up compared to BaseL: up to two orders of
Table 2: Summary of hyperparameters used in the ex- periments
magnitude for linear regression and up to around 23x for logistic regression (for Cov (large 1) and Cov (large 2) with low deletion rate). Even when the feature spaces are extremely
Table 3: Memory consumption summary (GB)
large, with deletion rate 0.1%, there is around a 2.6x speed-up for dense datasets (cifar10 in Figure 3c) and only 10% for sparse datasets (RCV1 in Figure 3c), respectively (similar speed-ups were observed for other small deletion rates). The former shows the effectiveness of the optimization strategies in PrIU over dense datasets with a large feature space while the latter is due to the fact that the optimization strategies for dense datasets were not applied over the sparse ones. Notice that for linear regression, PrIU-opt is always faster than Closed-form. Figure 4 shows the results of repetitive model updates; PrIU-opt achieves an order of magnitude speed-up for HIGGS (extended).
column) compares the quality of the models obtained by PrIU/PrIU-opt with that of the models obtained by BaseL. For these results we chose the highest deletion rate in the experiments, i.e. 20%. For all the experiments, the validation accuracy (MSE in the case of linear regression) of the updated models obtained by PrIU and PrIU-opt match exactly the accuracy of the ones obtained by BaseL. Combined with the answer to Q2, we can conclude that PrIU-opt speeds up the
(Q4) We investigate why PrIU-opt has the same validation accuracy as BaseL by measuring the distance and similarity between the updated models computed by PrIU-opt and BaseL. The results are presented in Table 4 (again, ignore the columns for INFL). The results indicate that the updated model parameters computed by PrIU-opt are very close to the ones obtained by BaseL since the cosine similarity is almost 1 (see the “similarity” column) while the L2-dist is very small (see the “distance” column). An even finer-grained analysis, comparing the signs and magnitude of each coordinate in the model parameters updated by PrIU-opt and BaseL shows that there is no sign flipping and only negligible magnitude changes for PrIU-opt compared to BaseL when the deletion rate is small. Even with a large deletion rate of 20% in HIGGS, only 2 out of 58 coordinates flip their signs with small magnitude change.
(Q5) The model update time of INFL is also included in Figures 2 and 3. Note that it can be up to one order of magnitude better than PrIU-opt, which is expected since using INFL to update the model parameters does not require an iterative computation. However, there is a significant drop in validation accuracy of the updated model derived by INFL compared to BaseL and PrIU-opt (see Table 4), which is due to the significantly higher L2-dist (see the “distance” column) and lower cosine similarity (see the “similarity” column) of its updated model compared to the model derived by BaseL. We conclude that PrIU and PrIU-opt produce much better models than INFL yet can still achieve comparable speed-ups.
(Q6) Effect of mini-batch size. The effect of mini-batch size is seen by comparing Cov (large 1) and Cov (small). One observation is that with larger mini-batch size, the maximal speed-up of PrIU-opt is around 23x, while with the smaller mini-batch size it is only about 6x, see Figures 2a and 2b This confirms the analysis in Section 5. In the second set of experiments, we used a small mini-batch size for Cov (1000) and Heartbeat (500), resulting in only 4.62x and 3.2x speed-ups by PrIU-opt, respectively (see Figure 4).
Figure 4: The execution time of repetitively removing 10 different subsets
Effect of number of iterations. A comparison of Cov (large 1) and Cov (large 2), which have the same mini-batch size but a different number of iterations, can be found in Figures 2b and 2c. We observe that no matter how many iterations the program runs for, at the same deletion rate PrIU-opt achieves a similar speed-up against BaseL. For example, we have up to around 23x speed-up for small deletion rates and smaller speed-up for higher deletion rates (note the difference in y-axis scale between Figures 2b and 2c). However, increasing the number of iterations increases the amount of provenance information cached for PrIU-opt, thus requiring more memory. As Table 3 indicates, since there are 6x iterations for Cov (large 2) compared to Cov (large 1), roughly 6x memory is needed, confirming the analysis in Section 5. However for Cov, with a large mini-batch size and 500 iterations, convergence is achieved and we do not observe a difference in validation accuracy between Cov (large 1) and Cov (large 2). Note that according to [10, 37, 43], the theoretical optimal number of passes for logistic regression
Table 4: Accuracy and similarity comparison between PrIU-opt and INFL with deletion rate 0.2
using mb-SGD (one pass equals to the total number of iterations divided by the number of iterations used for going through the full training set) is quite small. However, for Cov (large 2) the number of passes over the full training set is quite large (3000/(581012/1000060). Such a high memory usage should therefore not arise in practice.
(Q7) In terms of the update time for experiments over datasets with a comparable mini-batch size but with different feature space sizes (Heartbeat VS HIGGS), we notice that a larger number of model parameters leads to poorer performance by PrIU-opt (compare Figures 3a and 3b). This is also validated through a second set of experiments in which HIGGS (extended) achieves significant speed-up compared to Heartbeat (extended) (see Figure 4). This confirms the analysis in Section 5, where we show how the asymptotic execution time of PrIU and PrIU-opt depends on the number of the model parameters.
(Q8) Table 3 shows that in most cases, both PrIU and PrIUopt only consume no more than 5x memory compared to BaseL (ignore the number for Cov (large 2) since, as discussed earlier, it is a rare case in practice). However, with a large number of model parameters (like cifar10 and Heartbeat) there is over 10x memory consumption for PrIU and PrIUopt. How to decrease the memory usage for dense datasets with large feature space is left for future work.
Discussion. Extensive experiments using linear regression and logistic regression over the datasets above show the feasibility of our approach. PrIU and PrIU-opt can achieve up to two orders of magnitude speed-up for incrementally updating model parameters compared to the baseline, especially for large datasets with a small feature space. This is done without sacrificing the correctness of the results (measured by similarity to the updated model parameters by BaseL) and the prediction performance. The experiments also show that the optimizations used in PrIU-opt give significant performance gains compared to PrIU with only a small loss of accuracy. We observe that INFL is not a good solution because of the poor quality of models produced when more than one sample is removed.
Limitations. Our experiments also show the limitations of our solutions. They concern the memory footprint when the feature space or the number of iterations is large (anticipated by several analyses in Section 5) and the marginal speed-up for large sparse datasets (See Section 5.3). We shall endeavor to approach these limitations in future work.
In this paper, we build a connection between data provenance and incremental machine learning model updates, which is useful in many machine learning and data science applications. Building on an extension of the provenance semiring framework [21] to include basic linear algebra operations [52], we capture provenance in the training phase of linear regression and (binary and multinomial) logistic regression and address non-linear operations in logistic regression using piecewise linear interpolation. We prove that linearization does not harm convergence of the updated parameters and similarity to the expected results. Based on these theoretical results, we construct solutions, PrIU and PrIU-opt, which are optimized to reduce the time and space overhead. The benefits of our solutions are experimentally verified through extensive evaluations over various datasets. Looking forward, we believe that these solutions for simpler machine learning models are likely to extend to generalized additive models [24] and they also pave the way toward solutions for more complicated machine learning models such as deep neural networks.
This material is based upon work that is in part supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0047. Partial support was provided by NSF Awards 1547360 and 1733794. Tannen’s work at the National University of Singapore was supported in part by the Kwan Im Thong Hood Cho Temple/Avalokiteśvara.
[1] Yael Amsterdamer, Susan B Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment 5, 4 (2011), 346–357.
[2] Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 153–164.
[3] Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review 60, 2 (2018), 223–311.
[4] Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In International conference on database theory. Springer, 316–330.
[5] Peter Buneman and Wang-Chiew Tan. 2018. Data Provenance: What next? ACM SIGMOD Record 47, 3 (2018), 5–16.
[6] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015. 1721– 1730.
[7] James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1, 4 (2009), 379–474.
[8] Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201–2206.
[9] R Dennis Cook. 1977. Detection of influential observation in linear regression. Technometrics 19, 1 (1977), 15–18.
[10] John Darzentas. 1984. Problem complexity and method efficiency in optimization. Journal of the Operational Research Society 35, 5 (1984), 455–455.
[11] Shagnik Das. [n.d.]. A brief note on estimates of binomial coefficients.
[12] Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.
[13] Amol Deshpande and Samuel Madden. 2006. MauveDB: supporting model-based user views in database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 73–84.
[14] Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning crowdsourced labels using oracles for statistical classification. Proceedings of the VLDB Endowment 12, 4 (2018), 376–389.
[15] Finale Doshi-Velez and Been Kim. 2017. A roadmap for a rigorous science of interpretability. arXiv preprint arXiv:1702.08608 150 (2017).
[16] Tommy Ellkvist, David Koop, Erik W Anderson, Juliana Freire, and Cláudio Silva. 2008. Using provenance to support real-time collaborative design of workflows. In International Provenance and Annotation Workshop. Springer, 266–279.
[17] Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers.
[18] Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2007. Update exchange with mappings and provenance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 675–686.
[19] Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. 2010. Provenance in ORCHESTRA. (2010).
[20] Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 31–40.
[21] Todd J. Green and Val Tannen. 2017. The Semiring Framework for Database Provenance. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017. 93–99.
[22] Priyank Gupta, Nick Koudas, Europa Shang, Ryan Johnson, and Calisto Zuzarte. 2015. Processing analytical workloads incrementally. arXiv preprint arXiv:1509.05066 (2015).
[23] Sona Hasani, Saravanan Thirumuruganathan, Abolfazl Asudeh, Nick Koudas, and Gautam Das. 2018. Efficient construction of approximate ad-hoc ML models through materialization and reuse. Proceedings of the VLDB Endowment 11, 11 (2018), 1468–1481.
[24] Trevor Hastie and Robert Tibshirani. 1986. Generalized additive models. Statist. Sci. 1, 3 (1986), 297–318.
[25] Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. arXiv preprint arXiv:1904.02285 (2019).
[26] Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The ORCHESTRA collaborative data sharing system. ACM Sigmod Record 37, 3 (2008), 26–32.
[27] Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001–. SciPy: Open source scientific tools for Python. http://www.scipy.org/
[28] Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2018. Model Assertions for Debugging Machine Learning. https://www-cs.stanford.edu/~matei/papers/2018/mlsys_model_ assertions.pdf Preprint.
[29] Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795–811.
[30] Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885–1894.
[31] Rainer Kress. 1998. Interpolation. Springer New York, New York, NY, 151–188. https://doi.org/10.1007/978-1-4612-0599-9_8
[32] Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).
[33] Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. 2016. Towards reliable interactive data cleaning: a user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. ACM, 9.
[34] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948–959.
[35] Sanjay Krishnan and Eugene Wu. 2017. Palm: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.
[36] Raunak Kumar and Mark Schmidt. 2017. Convergence rate of expectation-maximization. In 10th NIPS Workshop on Optimization for Machine Learning.
[37] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 9–48.
[38] Zachary Lipton. 2016. The Mythos of Model Interpretability. CoRR abs/1606.03490 (2016).
[39] Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible models for classification and regression. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012. 150–158.
[40] Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 253–264.
[41] Huazhong Ning, Wei Xu, Yun Chi, Yihong Gong, and Thomas S Huang. 2010. Incremental spectral clustering by efficiently updating the eigensystem. Pattern Recognition 43, 1 (2010), 113–127.
[42] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
[43] Boris T Polyak and Anatoli B Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30, 4 (1992), 838–855.
[44] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723–1726.
[45] Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810 (2018).
[46] Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13.
[47] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407.
[48] Mark Schmidt. 2014. Convergence rate of stochastic gradient with constant step size. (2014).
[49] Jennifer She and Mark Schmidt. 2017. Linear convergence and support vector identifiation of sequential minimal optimization. In 10th NIPS Workshop on Optimization for Machine Learning. 5.
[50] W Sun and Lars Erik Sjöberg. 2001. Convergence and optimal truncation of binomial expansions used in isostatic compensations and terrain corrections. Journal of Geodesy 74, 9 (2001), 627–636.
[51] Alan Weiser and Sergio E Zarantonello. 1988. A note on piecewise linear and multilinear table interpolation in many dimensions. Math. Comp. 50, 181 (1988), 189–196.
[52] Zhepeng Yan, Val Tannen, and Zachary G Ives. 2016. Fine-grained Provenance for Linear Algebra Operators.. In TaPP.
[53] Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. 2010. Efficient querying and maintenance of network provenance at internet-scale. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 615–626.
A.1.1 Notations for objective functions, gradients and update rule. The objective functions for linear regression, binary logistic regression and multinomial logistic regression are shown as below (They are Equation (2)-(4) in the paper):
Note that h(w) can be rewritten as . For example for Equation (3),
For linear regression and logistic regression, the rule for updating
under mb-SGD is presented below (They are Equation (5)-(6) in the paper and we use
to denote the average gradients evaluated over the mini-batch at the
iteration):
Note that in Equation (25), the non-linear part can be abstracted as . So this formula can be also represented as:
So
Also we can explicitly evaluate
in which should be a semi-definite matrix since
is a monotonically decreasing function and thus
should be negative for any x.
A.1.2 Notations for the linearized update rule. After the interpolation step over the update rules for binary logistic regression, Equation (25) can be approximated as (It is Equation (9) in the paper):
which can be also represented as:
in which Suppose after removing certain subset (the number of those samples is
and the corresponding indices are R), Equation (29) becomes (It is Equation (11) in the paper):
For the linearized version of the update rule of logistic regression in Equation (29) and the update rule in Equation (30), we
represent
where and
can be considered as pseudo-derivative in Equation (29). So Equation (29) and Equation(30) can be rewritten as:
In contrast, by computing the model parameter from the scratch for logistic regression after removing the same set of training samples, the update rule is (It is Equation (12) in the paper):
which aims at minimizing the following objective function:
in which R represents the ids of the samples that are removed and represents the number of the removed samples and
Similarly after removing certain subset, the update rule for linear regression model is (It is Equation (13) in the paper):
The provenance expression for the model parameters of linear regression model and logistic regression model after removing subset of training samples are (They are Equation (8) and Equation (10) in the paper):
There are some useful properties related to matrix theory, matrix norm, real analysis and SGD convergence, which will be used in the follow-up proof.
Lemma 2 (SGD convergence, [3]). (Full version of Lemma (1) in the paper) Suppose that the stochastic gradient estimates are correlated with the true gradient, and bounded in the following way. There exist two scalars such that for arbitrary
the following two inequalities hold:
If the gradient estimates are unbiased, then . Moreover,
is the minibatch size, because
is the variance of the stochastic gradient.
So the convergence condition for fixed step size becomes suffices to ensure convergence.
So in what follows, we will simply consider the case where the learning rate is a constant across all the iterations as Lemma 2 indicates.
Lemma 3. For a matrix A, its norm equals to its largest singular value and the maximal eigenvalue of matrix
, i.e.:
If A is a semi-definite matrix, its eigenvalue is the same as its singular value, then the equation above can be rewritten as:
is invertible, then the following statements are equivalent:
Lemma 6. Cauchy schwarz inequality For any two matrix A and B, their norm should satisfy the Cauchy schwarz inequality, i.e.: represents any matrix norm
Lemma 7. For any three Hermitian matrices, M, N, P satisfying M = N + P, the eigenvalues of M is:
the eigenvalues of
and the eigenvalues of
the following inequalities hold:
The following lemma requires the definition of Lipschitz-continuity and Strong-convexity, which are provided below:
Lipschitz-continuous A function f (x) is Lipschitz-continuous (continuous) if there exists a constant L such that the
Another form of Equation (47) is:
Other equivalent forms of Equation (49) are:
Then there is a useful lemma about strong convexity, i.e:
Lemma 9. Piecewise linear interpolation In Piecewise linear interpolation [31], we assume that the function to be approximated is a continuous function f (x) where . Piecewise linear interpolation starts by picking up a series of breaking points,
such that
and then constructs a linear interpolant s(x) over each interval
as follows:
Lemma 10. Expectation of the number of the removed samples Because of the randomness from mb-SGD, the samples can be viewed as uniformly distributed within all n training samples, which can be considered as a 0
1 Bernoulli distribution with probability
. In other words, we can define a random variable
for each sample, which is 1 with probability
and
and
In mb-SGD, a typical assumption is used for the convergence analysis of the model parameter in Equation (24)-(25) and the update rules for other general models, i.e.:
over the all the samples, i.e.:
In what follows, our analysis is based on the following assumptions:
Assumption 1. every Lipschitz continuous. Since
norm regularization term, then we also know that
strong convex.
Theorem 10. (It is Theorem 2 in the paper) in Equation (40) and
in Equation (41) need not converge under the conditions in Lemma 2.
Proof. Let us take linear regression as an example. Note that we can explicitly evaluate the second order derivative of h(w) for linear regression, i.e. , then according to Assumption 1,
should satisfy the following inequality:
In order to prove Theorem 10, we need to show that there exists a case where cannot converge under the conditions in Lemma 2. This is achieved by considering gradient descent (GD) without excluding any original training samples, i.e.
, every
in Equation (40) and every
includes all n samples in Equations (24) and 40. We can then apply the update rule in Equations (24) and (40) recursively, which ends up with:
According to Assumption 1, the following inequality should be satisfied:
which implies that
Then by applying Equation (7), Equation (61) also leads to:
Then we can expand the first term in the right-hand side of Equation (58), the tensor product with provenance monomial should be
. According to the convergence conditions in Lemma 2,
and thus
. According to [11, 50], when
,
should be very close to 2
and
thus when (note that
can be any value between 0 and 1 according to Equation (62)),
, which means that the tensor product with provenance monomial
cannot converge and thus
cannot converge.
Theorem 11. (It is Theorem 3 in the paper) The expectation of in Equation (40) and of
in Equation (41), converge when
if we also assume that provenance polynomial multiplication is idempotent.
Convergence proof for linear regression We simply need to consider whether converges (sup-pose there are
provenance tokens in total that are set as 0, which corresponds to the deletion of
samples), which equals to the update rule in Equation (39) and leads to a new objective function without the
removed samples (denoted by
In what follows, we will only prove the convergence of binary logistic regression, i.e. the convergence of , which is the same as proving the convergence of
. The convergence of linear regression and multi-nomial logistic regression can be proven in the similar way.
According to Lemma 9, the following equation holds:
for all
(rather than
or
since
and
are evaluated when w is
). Since
, then
. Then by bringing in the definition of
, we know that:
By using the fact that , the following formula also holds:
But note that since every is a negative value, then
is a semi-positive definite matrix and thus:
Then we bound
Similarly we know that the following inequality holds for any w:
Then the convergence of can be derived below (in the case of constant
according to Lemma 2):
inequality and the definition of , the formula above is bounded as:
By using the result from Equation (68), we get:
Theorem 12. (This is Theorem 4 in the paper) is bounded by
is an arbitrarily small value representing the length of the longest sub-interval used in piecewise linear interpolations.
Proof. By subtracting Equation (25) by Equation (35) and taking the matrix norm, we get:
Then by using triangle inequality and the bound from Equation (68), the formula above is further bounded as:
Then by using the result from Equation (63), the formula above is rewritten as:
Then by applying the formula above recursively, we have:
According to Assumption 2 and Equation (63) and by using the triangle inequality and the Theorem above, we have:
Then by using the cauchy mean value theorem over , the formula above is bounded as:
By rewriting the formula above and using the upper bound on
Then by using the result from Equation (69) and Equation (54), the formula above is bounded as:
Theorem 14. (It is Theorem 5 in the paper) is bounded by
number of the removed samples and
is defined in Theorem 12
Proof. By using the definition of and subtracting the former one from the latter one, we have:
Then by using the result from Equation (68), the formula above is bounded as:
in which we bound
The last step uses the result from Equation (63). Then by using the Cauchy mean value theorem on we know that:
Then by adding and subtracting in the first term and using the fact from Equation (64) and Assumption 3, the formula above is bounded as:
Then by using the result from Theorem 13, the formula above is bounded as:
which is then plugged into Equation (72), we have:
which is then used recursively. Then we have:
Theorem 15. Approximation ratio (It is Theorem 6 in the paper) Under the convergence conditions for bounded by some constant C. Suppose
where
is a small value, then the change of model parameters caused by the approximation will be bounded by
Proof. The approximate update rule for linear regression by using SVD after removing subsets of training samples is:
By comparing Equation (39) against this approximate update rule, the only difference is in Equation (73) and
in Equation (39). Then according to the condition in this theorem,
. So by subtracting Equation (39) by Equation (73), the result becomes:
Then we evaluate the expectation between both sides, which ends up with:
in which equals to the hessian matrix for the object function over the remaining samples after removing subsets of samples, which should be still
strong convex and
lipschitz continuous. It indicates that:
Since according to Lemma 2,
1 So we can compute Equation (75) recursively and thus the following inequality holds:
Theorem 16. (Approximation ratio) (It is Theorem 7 in the paper) The approximation of PrIU-opt over the model parameters is bounded by
Proof. The update rule through gradient descent for linear regression model is as below:
while the approximated update rule through the approximations by incremental updates over eigenvalue is:
According to [41], the difference between
and
is bounded by
. So by subtracting Equation (79) by Equation (80), the results become:
Through the similar analysis to Theorem 15, the above formula is computed recursively, which ends up with:
Theorem 17. (Approximation ratio) (It is Theorem 8 in the paper) Similar to Theorem 15, the deviation caused by the SVD
approximation will be bounded by , given the ratio
. So using Theorem 14,
Proof. Let’s assume that the incremental updated model parameter without SVD approximation is . According to the results in Theorem 14, the
. After the SVD approximation, similar analysis to Theorem 15 can be done such that
Theorem 18. (Approximation ratio) (It is Theorem 9 in the paper) Suppose that after iteration the gradient of the objective function is smaller than
, then the approximations of PrIU-opt can lead to deviations of the model parameters bounded by
. By combining the analysis in Theorem 14,
is bounded by
Proof. Let’s assume that after iteration, the incremental updated model parameter without only linearization approximation is
. Then
based on Theorem 14. Also let’s assume that after
iteration, the incremental updated model parameter with both linearization approximation and SVD approximation is