Sampling for Deep Learning Model Diagnosis (Technical Report)

2020·Arxiv

Abstract

Abstract

Deep learning (DL) models have achieved paradigmchanging performance in many fields with high dimensional data, such as images, audio, and text. However, the black-box nature of deep neural networks is a barrier not just to adoption in applications such as medical diagnosis, where interpretability is essential, but also impedes diagnosis of under performing models. The task of diagnosing or explaining DL models requires the computation of additional artifacts, such as activation values and gradients. These artifacts are large in volume, and their computation, storage, and querying raise significant data management challenges.

In this paper, we articulate DL diagnosis as a data management problem, and we propose a general, yet representative, set of queries to evaluate systems that strive to support this new workload. We further develop a novel data sampling technique that produce approximate but accurate results for these model debugging queries. Our sampling technique utilizes the lower dimension representation learned by the DL model and focuses on model decision boundaries for the data in this lower dimensional space. We evaluate our techniques on one standard computer vision and one scientific data set and demonstrate that our sampling technique outperforms a variety of state-of-the-art alternatives in terms of query accuracy.

I. INTRODUCTION

Deep learning (DL) models have enabled unprecedented breakthroughs in developing artificial intelligence systems for analyzing high-dimensional data, such as text, audio, and images. Building such models is a data intensive task. To build an effective model, a machine learning (ML) practitioner needs to proceed in an iterative fashion, building and tuning dozens of models before selecting one. While naive selection of the best model could be based on statistical measures such as, accuracy, F1 score, etc., examining what the model is learning and why it is making mistakes requires access to artifacts, such as model activations and gradients. Activation values, or activations, are learned representations of input data. Gradients are partial derivatives of the target output (e.g., the true label of the input data) with respect to the input data. At a high level, activations and gradients are high dimensional vectors with sizes that depend on input data dimensionality and DL model architecture. While activations depict what the deep learning model sees, gradients depict areas of high model sensitivity. The naive solution of pre-computing and storing all artifacts required for model diagnosis scales as the product of size of input data and number of parameters of the deep learning model. For instance, consider a VGG-16 [56] model trained on CIFAR-10 [9]. CIFAR-10 is a moderate sized data set

with 60k images, 32x32x3 pixels each; VGG-16 [56] is a deep learning model with 22 layers and 33, 638, 218 learned parameters. Storing activations for ten experiments of training CIFAR-10 data on a VGG-16 models results in 350GB of data. Although the total number of artifacts for small data sets and models is manageable, an overhead that is three orders of magnitude larger than the input data per model is not scalable. This makes it difficult to efficiently perform diagnosis tasks, often preventing interactive diagnosis. Thus, with hundreds of gigabytes of artifacts per model, building, diagnosing, and selecting a DL model becomes a large-scale data management challenge.

Previous attempts at solving this problem either pre-generated all data required to provide interactive query times [23], [24], [31], [48] or utilized a variety of storage optimization techniques to manage the storage footprint [34], [55]. Both approaches require pre-generated artifacts. Several visualization tools pre-generate some of the aggregates and severely limit the type of queries that can be posed, while others simply do the latter. Systems with storage optimizations [55] reduce the storage required for this data by utilizing techniques such as de-duplication and quantization, etc.

Sampling is a fast and flexible database technique for approximate query processing, it works well in high dimensions [1], [3], [7], [20] and is a potential candidate for this workload. However, many queries posed for model diagnosis depend on retrieving the top-k maximally activated neuron(s) (see Section III for workload characterization). Processing these queries from samples is difficult. The natural estimator for a top-k query over the sample is the top-k on the sample; however, to un-bias this sample we need to access the full distribution of frequencies in the un-sampled data set, which is the set of activations for the entire data set over the entire model, a number far too vast ot generate or store.

The key insight we discovered for creating a sample that can be utilized for DL model diagnosis is that the DL model, along with its other objectives such as classification, learns a lower dimensional representation of the data. DL training transforms the input data, creating a new representation with each layer. Therefore, to diagnose the model, we leverage this lower dimensional representation of data rather than store and analyze the distribution of activation values to create a sample (see Section IV for details). The sampling technique that we develop specifically targets model diagnosis queries, which include top-k queries as well as average values and, provide more accurate answers than uniform sampling, stratified sampling, and other sampling techniques from the literature. Our technique selects a sample from the input (test and training) data set so that artifacts need to be generated only for the sample. This approach not only reduces the storage footprint and speeds-up queries since we store less data, but it also speeds-up the overall diagnosis process by saving the time it would otherwise take to generate all artifacts for the entire data set.

In summary, our contributions include:

• Characterizing requirements of DL model diagnosis by studying debugging queries in the literature. We further develop a simple benchmark for this novel workload by generalizing individual queries used in model debugging papers into query sets that cover families of queries (Section III).

• Presenting a new technique for creating samples for DL model diagnosis (Section IV).

• Evaluating our approach on two data sets and demonstrating its performance compared with a variety of state-of-the-art alternatives. Our sampling technique can be used to debug any deep learning model where a lower dimensional representation of the input data is learned in a supervised, semi-supervised or unsupervised manner.

II. PRELIMINARIES

We now summarize current approaches for DL model diagnosis and their associated data management challenges, which we address in this paper.

A DL model takes as input a vector and produces as output another vector S(x) = where C is the total number of output neurons. DL models are constructed in layers, intermediate layers are called hidden layers, and each hidden layer of the model is vector-valued. The dimensionality of these hidden layers determines the width of the model, and the number of hidden layers determines its depth. These layers often perform different operations such as convolutions, pooling, dropout, etc. - and are named accordingly. When the model is evaluated over an input data instance, such as an image, it produces a value for each C neuron. The raw values thus produced are activations, and derivatives of these values with respect to a target, such as class label, are gradients. Diagnosis of DL models relies on these artifacts. The ML community has a variety of techniques to diagnose these models, which we discuss below.

1) Visualization: Manual visual inspection, is a popular diagnosis technique for DL models [17], [23]–[25], [31]. Standalone tools for visual inspection of DL models built on image data (Cnnvis [31]) and text data (Activis [23], LSTMvis [48]) have been proposed. Some visualization capability is also integrated with deep learning libraries (e.g. Tensorboard [51], etc.). These tools provide static and interactive visualizations of DL model activations and on occasion, gradients. They let ML practitioners view activation or gradient patterns for

TABLE I: Data size and model sizes for standard ML data set (MNIST) and scientific image data set (Galaxy Zoo2).

various layers as well as view aggregates (e.g., average activation) over sets of input data instances belonging to each class, which may be classified correctly or incorrectly. This lets ML practitioners identify specific neuron pattern anomalies and neuron groups and data instances that require further investigation. Challenge: The size of the artifacts required for these visualizations depends on the size of the input data, and the complexity of the model. It can easily be 3 orders of magnitude larger than the input data set as shown in Table I. To support interactive visualization for arbitrary queries, these artifacts must be pre-computed since real-time computation is too slow to be interactive. To deal with the associated data explosion, tools such as Activis [23] limit the number of layers that can be visualized in the tool.

2) Examining learned representation: A DL model simultaneously learns a lower level representation of the data and a classifier (in the case of supervised learning). The learned representation (activations of neurons over an input data set) is used for a variety of goals, such as understanding how a model discriminates between different classes, comparing different model architectures or hyper-parameters, and examining how learning progresses over time by analyzing representations at various checkpoints in the learning process [27], [33], [37]. Challenge: These analyses require activations for the entire model(s) over the entire input data set. If the training process is being examined, the activations for multiple checkpoints must be generated and stored. As above, the required artifacts, especially if diagnosing multiple models or multiple checkpoints, can result in a data explosion.

3) Feature visualization and saliency analysis: The feature visualization techniques answer questions about what a DL model or parts thereof are looking for by generating examples from the learned model [38]. Feature visualization uses derivatives to iteratively modify an input, such as random noise, with the goal of finding the input that maximally activates a particular neuron(s). Saliency analysis identifies parts of the input that have the largest effect on the output. This consists of a number of approaches that propagate gradients through the model to identify areas of highest activation and highest sensitivity [32], [44], [45], [49], [61]. Challenge: Even simple DL models consist of hundreds of thousands of neurons (e.g. Table I). Finding the appropriate set of neurons to visualize can be beyond the powers of human cognition. Saliency analysis works on a per-input-data-item basis; ML practitioners would need specific input data points, such as images, for these methods. DL data sets consist of tens of thousands of instances, picking appropriate data instances from these large data sets is imprecise, especially if the data set is

new, large and contains unexplored scientific data.

4) Statistical analysis: Many data sets are annotated. Language models are annotated with parts of speech or linguistic features and image data sets are annotated with object information. For instance, Broden data set [8], has pixel-level annotations that indicate the object to which each pixel belongs. These annotations are used to pose hypotheses and conduct statistical analyses between neuron(s) activations and annotation to evaluate these hypotheses [43]. Challenge: Statistical analyses require such annotations to formulate hypotheses. The two data sets we utilize do not have any annotations. Indeed, most scientific image data sets do not, which makes statistical analysis impossible.

III. WORKLOAD CHARACTERIZATION

We now develop a summary workload that captures the requirements of a large set of DL model diagnosis queries. Model diagnosis techniques, such as visualization and examination of learned representation, bring the number of neurons and data instances to be examined to a smaller and manageable number. This section focuses on queries from these two categories, as these queries helps make downstream analysis, such as feature visualization, attribution and saliency analysis tractable over massive data sets. We do not include queries from statistical analysis as it requires annotations on the data set.

An ML practitioner typically starts model diagnosis with techniques utilized by visualization tools from Section II. They create data subsets that are incorrectly classified, generating aggregates (such as average activations, top-k highest activations etc.), and compare them to similar aggregates for data instances that were correctly classified for each class. They start the analysis from sets [23], [31], such all incorrectly labeled instances of , rather than specific instances to find such patterns. This analysis lets them identify important patterns for the various subsets and reduce to a manageable number both data instances and parts of the model (layers and neurons) to be examined [23], [31]. They then start correlating input data and parts of the model, conducting attribution and saliency analyses. Similarly, ML practitioners comparing two different models trained on the same data leverage techniques listed in examining learned representation from Section II. For instance, they generate neuron activation values for each data item for both models for each layer. They compare these to the logits for each class learned by the respective model to decipher each model’s rate of learning to understand the impact of additional layers and their sizes and thus diagnose how complex the model must be to complete this task.

Table II lists representative queries from the literature used to diagnose DL models. We make two observations from this list of queries. First, DL model diagnosis queries requrie one of three quantities: the top-k maximally activated neurons, the distribution of maximally activated neurons or the average activation values. The focus on maximal and top-k values as opposed to minimal values is due to activation functions [2] used in DL models. Without such functions DL models are just complicated linear regression models. ReLU is the most commonly used activation function today [41]. It removes negative values and propagates positive values. Mathematically, ReLU is defined as max(0, val). Therefore, in the DL literature sample queries often focus on average or maximal values but not minimum values. Thus, to characterize an ML diagnosis workload instead of focusing on all aggregates we focus on three aggregates (1)Top-k maximally activated neurons, (2) Average activation values for neurons and (3) distribution of maximally activated neurons.

Second, each query in Table II is part of a family of queries. For instance, the answer to Q1 requires average values of all neurons for a specific layer (conv2) for all classes. A family of queries for Q1 would include average values of neurons for any layer and any class where data instances could be correctly or incorrectly classified. We can see that queries Q3 and Q1 belong to the same family. Similarly, Q2 belongs to a family of queries that require top-N neurons, across classes, layers, incorrect, and correct classification. Thus, to charecterize this workload we utilize the entire family of queries. We call these families of queries query sets.

We now introduce some notation and define query sets formally.

A DL model M is a vector of N units or neurons. M is learned and tested over data D. Artifacts, such as activations A, are vectors of the same dimensionality as M, computed over data are the activation value(s) of i neuron(s), where , on d data item(s), where . A query set S computes a measure , such as the mean, top-K, count, or count of maximum values for etc. Given the preceding notation we can define a DL model diagnosis query set:

Definition 1. A query set is a set of queries, where , and is a measure.

Instead of evaluating our techniques on specific queries from Table II, we utilize the three query sets shown in Table III to characterize DL model diagnosis workload. These query sets include all queries of a specific family. We leverage these query sets to measure effectiveness of sampling techniques to ensure these techniques do well on the entire family of queries represented by the query set, not just on individual queries.

Query Q2 and all queries of this family are represented by query set S1, which computes the top-K maximally activated neurons. Queries Q1, Q3, and others of this family are represented by query set S2, which computes the average activation for neurons. Queries Q4, Q5, Q6 and Q7, and others of their family are jointly represented by query sets S2 and S3, because finding similarity depends on the average neuron values and the maximally activated neuron distribution.

Query sets can consist of any combination of neurons and data items. Instead of considering this immense set of combinations, we limit our evaluation to all combinations of layer, class and classification (correct or incorrect). Thus, to measure accuracy of a query set for a sample, we first compute the query results for each of these combinations (layer, class and classification). Next, we compute a metric

Q1. What is the average value for all neurons for layer across all classes? [17], [23], [33] Q2. What are the top k maximally activated neurons for layer Conv2 for all incorrectly classified objects for [25], [31] Q3. What is the average neuron activation pattern for the last hidden layer in for incorrectly classified correctly classified ? [23], [25], [46] Q4. Compute the similarity between the logits of and the representation learned by the last convolution layer by ], [28], [33] Q5. For images of classified as , what are all of the maximally activated neurons in the last convolutional layer? [23], [31] Q6. Does learn a representation for faster than it learns the representation for ? [27], [33] Q7. How similar are the representations learnt by two different model architectures, , on the same data set? [28], [33], [37]

TABLE II: Representative queries for deep learning diagnosis workload.

TABLE III: Query sets for deep learning model diagnosis workload.

comparing the results from the sample with the results for the same combination on the entire data set. Our comparison utilizes metrics specific to each query set, e.g., precision for S1, cosine distance for S2 and, Jensen-Shannon distance for S3 (we describe these metrics and the rationale for picking them in more detail in Section V). Finally, we calculate the over-all query set accuracy for each query set by averaging the value of the corresponding metric over the combinations.

IV. APPROACH

To enable interactive model diagnosis, our approach creates a sample. We compute the results of a query set on this sample instead of entire data. In this section we describe our approach and present other baseline techniques for selecting these samples.

The key insight we utilize to avoid generating and storing activation values is that DL models learn a lower dimension representation of the data, and a classifier. DL training transforms the input data, creating a new representation with each layer. Training criteria encourage training set neighbors, such as data points from the same class, to have similar representations. Leveraging this lower dimensional representation learned by the model has the dual benefit of reducing dimensionality of the data and focusing on the representation learned by the model. Since the objective of the workload is to diagnose this model, we hypothesize that leveraging the learned latent space to select a sample will be key to understanding what the model has learned. For model diagnosis we view the training, and test data points in the latent space, i.e., instead of viewing the data in the high dimensional original format of images, audio or text, we utilize this lower dimensional representation of the data learned by the model’s last hidden layer to create samples.

Our goal is to diagnose the model, which implies that a subset of the queries will focus on what the model got wrong, as seen in Table II. In a classification problem with multiple classes, the decision boundary partitions the underlying vector space into multiple regions, one for each class. Decision boundaries are where the output label of a classifier is ambiguous, i.e., where errors and mis-classifications occur. The diagnosis of a DL model requires exploration of the decision boundary for a model [18], [58].

For instance, Figure 1 depicts this lower dimensional representation for two data sets we use for evaluation, MNIST [36] and Galaxy Zoo2 [19]. We use a dimensionality reduction technique, t-Distributed Stochastic Neighbor Embedding (tSNE) [53], to reduce dimensions of this data to visualize it in two dimensions. In Figure 1, each point represents an image from the test set, and colors indicate the true class labels. We make two observations from this visualization: (1) the data representations in latent space show separation for each class, and (2) most mis-classified instances are on the edges of data points groupings.

For DL model diagnosis, the top-k maximally activated neurons and distribution of maximally activated neurons form a large subset of queries in addition to the average values of neurons. The top-k queries are an important area of database research. The best known general-purpose algorithm for identifying top-k items is the Threshold Algorithm [15], which operates on sorted multi-dimensional data required to compute the top-k elements. Approximate algorithms for top-k retrieval require building probabilistic models to fit the score distribution of the underlying data as proposed in [52]. However, we wish to avoid computing and storing activations for the entire data set.

Thus, our approach is based on utilizing the lower dimensional representation when selecting data items for our sample, and focusing on decision boundaries in the latent space when selecting the data points to include in our sample.

A. Baselines

The naive way of selecting a sample that covers the ndimensional latent space is to create a grid in that space and sample from each each partition. We sample here from the latent space; our goal is to ensure our sample contains instances of data that lay in different regions of that latent space. However, the latent space we choose is high dimensional, e.g., for the MNIST dataset, the latent space is 84D. Even if we divide each dimension into two buckets,

Fig. 1: T-SNE representation of test data or Galaxy Zoo2 (left) and MNIST (right) from the last hidden layer. Each data point represents an input image from the test set. Data point colors represent the true labels.

we get a total of buckets. Instead, we reduce the dimensionality of data in the latent space for this analysis using PCA. We call this naive technique simple latent space sampling. For this sampling technique, we collect equal number of instances at random from each underlying grid.

Another way to lower the dimensions is to utilize the classification result. Each data point is classified by the model as belonging to a class. This result is encapsulated in a confusion matrix (a.k.a. the error matrix), which tabulates the performance of a classification algorithm. For a binary classifier, the confusion matrix counts the number of true positives, false positives, true negatives, and false negatives. For multiple labels, the confusion matrix generalizes this concept. Each row of the matrix represents a predicted class, while each column represents a true class. In this technique, we sample based on cells in the confusion matrix. We call this technique stratified by confusion matrix (CM).

In database systems such as BlinkDB [3], strata are defined over a subset of columns that typically correspond to categorical valued attributes, e.g. city. For DL model diagnosis, the underlying data can be considered a relation, with each row representing a data item (e.g., one image) and each column a value of interest, such as the activation value for a neuron in the model. Each row can be extended with metadata, such as the predicted class and the class label. The stratified by CM sample thus serves as a stratified sample baseline.

In addition, we use two other techniques from the literature as baselines. First, we use visualization aware sampling (VAS) for large scale data visualization, such as scatter and map-plots. VAS is based on the interchange algorithm [39], which selects tuples that minimize a visualization-inspired loss function. Visualization-inspired loss is based on three common visualization goals: regression, density estimation and clustering. The interchange algorithm creates a sample that maximixes visual fidelity of the data at arbitrary zoom

levels.

Second, we use explicable boundary (EB) trees [57] to create a single sample from input data. This method constructs a boundary tree to approximate the complicated deep neural network models with high fidelity. EB trees provide a single sample for a dataset and a model which explains the boundary between each class learned by the DL model.

B. Clustering in Latent Space

An important part of our approach to selecting a sample for DL model diagnosis is to ensure that model decision boundaries are represented in the sample. To determine boundaries in latent space, we cluster data in latent space and fit a model to estimate the parameters for each class in that space. We do this in both supervised and unsupervised manner. When fitting a supervised model, we use the class labels. In the unsupervised case, we use parameterized models so we utilize the number of unique classes present.

In both supervised and unsupervised cases the models fitted to the latent space provide us with the likelihood that and object belongs to a class or cluster. For binary classification to determine whether an object belongs to class A or class B, let be the likelihood that a data instance belongs to class A. In this case, the points on the decision boundary of class A and class B are those for which the ratio is . A lower value of likelihood ratio would imply that in which case would be assigned to cluster or class B. The higher the likelihood that an object belongs to class A, the higher the ratio will be.

For a multi-class classifier, where a data point may belong to classes , this ratio would be,

Our sampling technique clusters the data in the latent space, then sorts data in each cluster or class by the ratio of likelihood of belonging to that particular class. This sorted list thus consists of exemplars on the higher end and outliers on the lower end of the list. We utilize a tuning parameter j to determine the proportion of exemplars and outliers in our sample. We select j% from the outliers and from the exemplars. Algorithm 1 describes this approach in further detail.

For the unsupervised technique, we utilize a parameterized clustering technique, the Gaussian Mixture Model (GMM). These models offer a probabilistic way to represent normally distributed sub-populations within an overall population. We set the number of clusters in GMM to be equal to the number of unique classes in the dataset. We utilize variational estimation for the GMM [6], where the effective number of components can be inferred from the data.

For the supervised technique, we use max-margin classifiers to classify the data in the latent space. Margin classifiers are a class of supervised classification algorithms that utilize distance from the decision boundary to bound the classifier’s generalization of error. Support vector machine (SVM) [50] is an example of this category of classifiers, which learns boundaries based on labels so that the examples of the separate classes are divided by a clear gap that is as wide as possible. SVMs utilize kernel functions [26]; these help to projecting data to a higher dimensional space where points can be linearly separated. DL models do not have non-linear activation functions after the last hidden layer, so the latent representation from last the hidden layer should enable discovery of linear boundaries. Thus, we utilize a linear kernel for SVM [16], which has the dual advantage of being faster than non-linear kernels and less prone to over-fitting. Results of the classifier are turned into a probability distribution over classes by using Platt scaling [40], [59]. These probabilities are used to sort the data items in each cluster or predicted class and then select a sample.

V. EVALUATION

Here, we empirically evaluate our hypotheses from our sampling approach, namely sampling evenly from the latent space is not sufficient; model decision boundaries are the most important region of this latent space for answering model diagnosis queries; and they must be well represented in a reliable sample.

We evaluated our sampling techniques from Section IV on two different data sets. We first describe metrics we used to evaluate query sets defined in Section III and data sets and DL models we used for experimental evaluation. We then describe the experiments we conducted and analyze their results.

A. Metrics

Query set S1 retrieves the set of top-K maximally activated neurons. To measure how well our sample performs we use precision as the metric. Precision is the fraction of top-k results from the sample that belong to the true top-k result. Precision lies between [0, 1]. A precision value of 0 implies that the sample top-k does not contain any of the full data top-k neurons.

Query set S2 retrieves the average value of neurons. This is a high dimension vector of floating points, where dimension is the number of neurons in a layer. Additionally, this is a sparse vector, i.e., many neurons may have zero average activation because of non-linear activation like ReLU. We used cosine distance [10] to measure the distance between the average vector for the entire dataset and the average vector from sample due to the properties of high dimensionality and sparseness of the average neuron vectors, which lies between [0, 1]. Cosine distance is defined as:

1 −

Query set S3 retrieves the distribution of maximally activated neurons. As this is a true distribution an obvious metric would be Kullback-Leibler (KL) divergence [29]. However, we encounter two issues with using this metric. First, KL divergence is unbounded, which means it is not a true metric, and it is difficult to assess how close two distributions were. Second, KL divergence is defined only on distributions with non-zero entries. This is not true for maximally activated neuron distribution, which may have neurons with zero counts. Thus, we used Jensen-Shannon divergence [22] instead, which is based on KL divergence. Jensen-shannon divergence is both symmetric and finite valued. Jensen-Shannon distance is squareroot of Jensen-Shannon divergence which is defined as:

and lies in [0, 1] .

B. Datasets and Models

We evaluated our sampling techniques on two data sets, Galaxy Zoo2 [19] and MNIST [36]. For each data set, we built and evaluated one deep learning model. The MNIST data set consists of 28x28 pixel gray-scale images of handwritten numerical digits with a training and test set of 60K and 10K images, respectively. We trained the six layer neural network

Fig. 2: Deep learning model for MNIST data set.

Fig. 3: Deep learning model for galaxy Zoo2 data set.

depicted in Figure 2. This model is a based on LeNet-5 [30] for classifying MNIST data set with added batch-normalization after every layer.

Galaxy Zoo2 [19] is a public catalog of galaxies from the Sloan Digital Sky Survey [42] with classifications from citizen scientists. The Galaxy Zoo decision tree [11] lists the questions answered by citizen scientists. We took a subset of this data set to classify images that appear edge-on vs face-on (question T01 in [11]). The training and test data sets consist of 54, 333 and 2118 images, respectively, each a 69x69 color image. We trained a model depicted in Figure 3, which is a variation of the model from [13]. In our variation of this model, we reduced the number of dropout layers and added batch normalization after every convolutional layer. We achieved 99% accuracy on the test set and an overall weighted F1 score of 0.99.

We trained the models and extract activations from them using the TensorFlow [51] library. For both models, we used the representation from the last hidden layer to drive our sampling technique, and the last hidden layer was a fully connected (FC) layer. The MNIST data representation is from layer FC-2 (Figure 2) with 84 neurons. The Galaxy Zoo2 data representation is from layer FC-1 (Figure 3) with 64 neurons.

C. Experiments

For our first experiment, we evaluated the three query sets on the two data sets using the metrics described above. For each data set, we created samples of size 5%, 10%, 20%, 40% and 80% for the eight sampling techniques we are evaluating. Our rationale for choosing sampling techniques is described in Section IV, here, we provide a brief description of each techniques:

(1) Random sampling draws a sample from the data set uniformly at random without replacement. (2) Stratified by CM sampling contains a sample with data items drawn from each cell of the confusion matrix in proportion to the number of data items in the cell. (3 and 4) Visually aware sampling (VAS) and Explicable Boundary (EB) tree sampling utilize the techniques specified in [39] and [58] respectively. (5) Simple latent space sampling divides the latent space into a grid and then samples equally from each cell in the grid. (6 and 7) GMM sampling fits a GMM to the data points in latent space. For each of the resulting clusters, data points belonging to each cluster are sorted by the likelihood ratio of belonging to that cluster. The sample is then created by selecting data points from the two ends of this list for each cluster, with a tuning factor j determining what fraction is selected from either end. We have two GMM samples since we evaluated impact of two types of co-variance matrix, spherical and full. Finally, (8) MaxMargin classification sampling classifies the data points in the latent space with a max margin classifier, sorting points in each class by the ratio of their likelihood belonging to that class, and choosing from the two ends of this list with a tuning factor of j, like the GMM samples.

For the first experiment GMM and MaxMargin samples - we fixed the tuning factor j to 0.7. We studied the impact of this tuning factor in the third experiment. The EB tree technique creates a single sample since there is a single boundary tree for a model and corresponding data. Figure 4 shows the results of this experiment for both data sets. As we increased sample size the query set results got increasingly more accurate until, at fraction 1.0 or on the full data set, the metrics for all sampling techniques were coincident at 1.0 for S1 and 0 for S2 and S3.

For simple latent space sampling we reduced the dimensionality of latent space from 84 and 64 to 5 for both MNIST and Galaxy Zoo2 and then divided each dimension into 2 bins, resulting in or 32 bins. We then sampled equally from each bin. This is the only technique where we sampled equally rather than sample in proportion to the number of items in the bin. We did this in order to evaluate the impact of sampling from the latent space. Interestingly, this technique did not do well on all three query sets. To minimize the impact of randomness, we selected each sample ten times and evaluated it and average results from these ten iterations. As we can see from Figure 4, the simple latent space sample behaved as well as the random sample. While this sample provided adequate results on S2, giving on average less than 10% error, its performance on S1 and S3 was not adequate. The knee seen for this sample (at 80% of the data set) occurred because

Fig. 4: Metrics for S1, S2 and S3 for increasing sample size for various sampling strategies. Top row MNIST, bottom row Galazy Zoo2. From left to right columns, S1, S2, and S3. The X-axis shows the sample size as a fraction of the entire data set.

at this point the sample had the fewest number of data items compared to other samples: data were unevenly distributed in the latent space, and we sampled equally from each bin rather than in proportion to the size of the bin.

The stratified by CM sample performed much better than both the random and simple latent space samples for S1 and S3. This is because accuracy over query set is determined by averaging query results from each layer, for each class, and for correctly and incorrectly classified data items. If a sample did not contain any instances of incorrectly classified data items from , for instance, then the query result was set to 0 for S1 and 1 for S2 and S3. As Table IV shows, random and simple latent space samples often do not have any data items from the incorrectly classified data items. VAS did as well as stratified by CM, this is of note because the VAS sample had no knowledge of classification of each data points and was trying to minimize a visualization-based loss function, which is trying to ensure that that the sample replicated the data density of the original distribution.

All three clustering-based samples GMM (full), GMM (spherical) and MaxMargin classification based samples did better than the baseline samples on all three query sets in most cases. GMM (full) did better than GMM (spherical) for both data sets. GMM (full) fit the data better, as expected, and thus did better on selecting exemplars and outliers when compared to GMM (spherical). The goodness of fit is dependent on the data set. GMM (spherical) does better than stratified by CM for the MNIST data set but worse for the Galaxy Zoo2 data set. From the two dimensional representation of the data in latent space for the two data sets in Figure 1 we can see a separation between the ten clusters in MNIST, while the two clusters in the Galaxy Zoo2 data set were not clearly separated. Additionally, for MNIST each cluster appeared to be somewhat symmetrical, but the two clusters for the Galaxy Zoo2 data set did not have a clear separation, and one of the clusters is highly asymmetrical. GMM (spherical) with a isotropic co-variance matrix has difficulty fitting the Galaxy Zoo2 data set. GMM (full) fit a more complex gaussian to each cluster, and this, in turn, provided a much better estimation of outliers vs exemplars. This difference can be seen in two data sets. While GMM (full) sample did better than GMM (spherical) sample for both data sets, the difference in performance was higher for when the underlying data distribution assumptions were not met for GMM (spherical). MaxMargin classifier-based sampling performed the best on all three query sets. This implies that this technique could distinguish between exemplars and outliers better than GMM, and a sample based on the MaxMargin classifier is better suited to addressing queries for DL model diagnosis. This is further supported by examination of the data items selected by each sample. Table IV shows the number of correctly and incorrectly classified data items selected by each sampling technique on a 5% sample. MaxMargin classification-based sampling selected the highest number of mis-classified data instances. These results for S1 and S3 support our hypothesis that emphasis on the decision boundary improves samples for all query sets. Finally, EB tree technique provided a single sample since there was only one boundary for model. As expected, it did well picking the outliers and therefore performed well on both S1 and S3. For both data sets, EB-tree based sample was the smallest and performed second best on these two query sets. However, as the EB tree sample focused inordinately on the outliers, it did not perform as well on S2. On the well-separated latent space for the MNIST data set the EB-tree performed on par with other sampling techniques. However, for the Galaxy Zoo2 data set, it did not perform as well. The MaxMargin classification-based sampling performs better than EB-tree sample for all three queries for both data sets.

In the second experiment, we examined the impact of varying the number of top-k neurons in S1 and measured the precision achieved by each of the eight sampling techniques. We varied the nuResults for this experiment are shown in Figure 5.

For the MNIST dataset, a 5% sample had a precision 0.98 for the top-100 neurons. However for the Galaxy Zoo2 data set this number was much lower, at 0.70 for the top-100 neurons. This is due to two factors: (1) the test data set, over which we evaluated this query for MNIST was 10k while for the Galaxy Zoo2 data set it is 2k. A 5% sample was 500 data items for MNIST and 105 items for Galaxy Zoo2 data set. (2) the model for MNIST had 107,786 parameters or neurons, and Galaxy Zoo2 had an order more parameters at 1,095,842.

Thus, a 5% sample for the Galaxy Zoo2 data set was both smaller and trying to capture a more complex model. This is confirmed by an additional experiment, where we increase the Galaxy Zoo2 sample size to 500 elements, we get 85% coverage on the top-100 neurons.

For both data sets MaxMargin classification sampling had the highest precision. EB tree was next for both data sets. This is not surprising because EB tree focuses on decision boundaries. This reinforces our hypothesis that decision boundaries need to be well represented for a sample to perform well on model diagnosis queries.

In the third and final experiment, we evaluated the impact of tuning factor j for three clustering samples GMM (full), GMM (spherical), MaxMargin. Tuning factor j is a number between 0 and 1 and is used to determine how many data points in the samples come from the lowest values of likelihood ratio or outliers. We evaluated the impact of this tuning factor on a 5% sample for all three sampling strategies. We evaluate query sets S1, S2 and S3 on tuning factor values of 0, 0.25, 0.50, 0.75 and 1.00. In all three sampling strategies, we picked data items in order from the sorted list for each cluster. Our sample is selected by selecting items from both ends of the sorted list and picking items from the head or outlier

TABLE IV: Number of data points in samples for each sampling strategy with correctly classified (incorrectly classified) data points in the two data sets.

end of the list and, from the exemplar end of the list. Thus, for the tuning factor value of 0, all data instances in the sample are picked from the exemplar end of the list and for a tuning factor value of 1 all data instances were picked from the outlier end of the list. In this experiment, we additionally created a weighted sample, where the weight was simply the reciprocal of the likelihood ratio. Likelihood ratio can be unbounded for exemplars, therefore for purposes of numerical stability we selected a threshold. To reduce the impact of random selection, we selected a weighted sample ten times and reported the average value. Figure 6 shows the results of this experiment. For S1, when the data set contained only exemplars at tuning factor 0, precision was the lowest for the sample. Precision grew as the value of tuning factor increased and plateaued at tuning factor . The max top-10 precision for the MNIST dataset was 0.8 and galaxy Zoo2 is 0.57. This was the max value for top-10 that can be achieved on a 5% sample with the three sampling techniques for either data set. For S2, the average activation value was not impacted as much by the tuning factor. The difference was small enough not to significantly impact the value of this metric. For S3, we saw results similar to S1. The highest values were at j = 0, because at this point there was the least amount of diversity in the data points; each cluster only contributed exemplar data points. As the number of outliers increased, the distance between the distribution became lower, the lowest point around , as the tuning factor increased further and the sample contains an increasing number of outliers this value became lower at a slower rate.

Weighted sample values are indicated by dashed lines on the Figure 6. These samples did not achieve the best value consistently and were significantly less deterministic in their performance.

D. Sample Creation Overhead

Here, we examine the time it took to create these samples for baselines as well as for our sampling techniques. Figure 7 depicts the time required to generate a 5% sample for all sampling techniques on the Galaxy Zoo2 test data set; note the log scale on the y-axis. Results for the MNIST data set were similar and are not shown. The Galaxy Zoo2 test set had 2118 points, each point is a vector of size [1, 64].

Fig. 5: Metrics for S1, number of top-k neurons in the 5% sample. Left panel MNIST, right panel Galaxy Zoo2.

Fig. 6: Impact of tuning factor j (x-axis) on metrics for MaxMargin and GMM based sampling strategies. From top row MNIST, bottom row Galazy Zoo2, from left to right columns S1, S2, and S3.

Generating the uniform sample was the fastest as expected. Generating, stratified by CM sample, and GMM samples, required similar time, taking less than a second. Generating MaxMargin classification-based sample required 1.5 seconds. Both VAS and EB-tree samples took three orders of magnitude more time. VAS is created with the interchange algorithm [39], each point in the data set has to be added and one point evicted, by comparing proximity of the added point with each element of the existing sample. This is where K is the sample size and N is number of points in the data set. For large data sets as in cases of ML, the time to create this sample was unacceptably long. Boundary stitching algorithm [58] is O(NK). This was faster than the VAS but still took longer

than our sampling technique.

VI. RELATED WORK

Our work is related to three different categories of research; approximate query processing, model diagnosis systems, and model lifecycle management and tuning systems. We review work from each of these categories below.

1) Approximate query processing (APQ) and top-K queries: ( [1], [3], [7], [20]) APQ is a well-studied area in databases and is an effective technique to deal with large-scale data. Algorithms for exact top-k queries are defined by the seminal work on the threshold algorithm (TA) [15], which require access to the indexed attribute(s) for a data set. Efficient

Fig. 7: Time to create a 5% sample for Galaxy Zoo2 dataset for all sampling techniques.

processing of the top-k queries over samples is a challenging task [21]. Related work in this category includes top-k processing techniques that operate on deterministic data but report approximate answers in favor of performance. The approximate answers are usually associated with probabilistic guarantees; indicating how far they are from the exact answer. Algorithms presented in [52] are an approximate adaptation of TA where the approximate answers to the top-k query is associated with probabilistic guarantees. However, like TA this algorithm requires access to sorted attributes for the underlying data. Another approach to approximate top-k answers is considered in similarity search for multi-media databases [4]. This method uses a proximity measure to determine if a data region should be inspected. This utilizes the underlying data distribution rather than individual column value and in that sense is closer to our approach (i.e., instead of examining the underlying data, we utilize the latent space to create a sample).

2) Model diagnosis systems: ( [5], [23], [24], [34], [55]) Model tracker [5] is one of the earliest systems for model diagnosis. It diagnoses models by tracking its performance using statistical measures, such as accuracy, AUC, etc. and does not support model diagnosis for DL models. MLCube [24], one of the earlier visualization tool for model diagnosis visualizes data from pre-computed data cubes based on features from data and model results. The data-cubes utilized by this tool are based on less than 100 features and like Model tracker, it pre-dates the large scale of data that must be supported for DL model diagnosis. MISTIQUE [55] supports DL model diagnosis via examination of model activations, their primary approach is to reduce the storage footprint required by activations. MISTIQUE shares our goals of reducing query runtime for model diagnosis, but it uses a different approach, quantization and de-duplication to reduce the storage. Modelhub [34] supports model diagnosis by storing learned models and training logs with an approach that reduces storage footprint. Modelhub focuses on different artifacts, learned models and training logs, which they store and retrieve efficiently by introducing a model versioning system and a domain-specific language for searching through model space, solving a very different problem. DeepBase [43] supports model interpretabiltiy and diagnosis by providing a declarative abstraction to express and execute the generation and comparision of these artifacts. DeepBase relies on the ability to encapsulate model interpretbility questions as hypothesis functions (e.g., parts of speech tags and image captions). DeepBase, ModelHub and MISTIQUE could benefit by leveraging our sampling techniques for their systems. Finally, a variety of visualization tools [23], [31], [48], [51], [60] utilize activations and gradients to interpret and diagnose DL models. All of these tools would benefit from our sampling techniques, as sampling would help reduce the scale of data required to support model diagnosis. Activis [23], for instance selectively pre-computes values for nodes of interest to save computation and storage. Sampling techniques such as ours will enable ML practitioners using tools such as Activis to avoid making such compromises.

3) Model lifecycle management and model tuning: ( [14], [35], [47], [54]) ModelDB [54] is a system for managing of ML models and pipelines. It provides versioning and metadatabased search and validation on models, simplifying the model building pipeline. MLflow [35] tracks experiments, packages the code to create reusable deployments and operationalizes the chosen models, addressing a very different aspect of model lifecycle management compared to ModelDB. However, neither of these systems help manage, store, or query any DL model diagnosis artifacts. While MLflow supports storing and tracking arbitrary artifacts in a framework and implementation agnostic manner, it does not utilize information such as representation learned by the models to help with the selection of appropriate model. In addition custom code has to be provided for generating and querying these artifacts in MLflow. These tools do not support model diagnosis or interpretability as a primary goal, if they were to adopt model diagnosis as a goal our sampling technique could help with managing the size of data required.

VII. CONCLUSION AND FUTURE WORK

Deep learning models have become an indispensable tool for a wide range of tasks, such as image classification, object recognition, speech analysis, machine translation, and more. The task of diagnosis for these purportedly black-box models requires additional artifacts, such as activations. These additional artifacts must be generated, stored, and queried for each DL model being debugged. The addition of these artifacts, which can be up to three orders of magnitude larger than the input data size for each model being diagnosed, turns the process of building, diagnosing, and selecting DL models in to a large-scale data management challenge. In this work, we quantify DL diagnosis workload and present a novel sample creation technique that reduce the time and complexity required to accomplish these tasks.

The sampling technique we present in this paper focus on sampling input data points, e.g. rows from the relation of data points and activations. The ML literature supports the notion of reducing the number of neurons for which activations need to be calculated [31], [33] and queried. We would like to explore this avenue in future work. The sampling technique described in this paper works well with supervised learning models, i.e. DL models built with labeled data. In future work, we would like to explore our sampling technique and their efficacy for unsupervised DL models, such as generative models, autoregressive models, etc. [12] A large body of scientific data is unlabeled and requires unsupervised learning techniques, and extending our sampling technique in this direction could be beneficial to the scientific community working on newer data sets.

Acknowledgements: This project is supported by NSF grants OAC-1739419, CCF-1535565, AST-1715122 and Charles and Lisa Simonyi Fund for Arts and Sciences, Washington Research Foundation.

REFERENCES

[1] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD ’99.

[2] https://en.wikipedia.org/wiki/Activation function.

[3] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In EuroSys ’13, 2013.

[4] G. Amato, F. Rabitti, P. Savino, and P. Zezula. Region proximity in metric spaces and its use for approximate similarity search. ACM Transactions on Information Systems (TOIS), 21(2):192–227, 2003.

[5] S. Amershi, D. M. Chickering, S. M. Drucker, B. Lee, P. Y. Simard, and J. Suh. Modeltracker: Redesigning performance analysis tools for machine learning. In CHI, 2015.

[6] H. Attias. A variational baysian framework for graphical models. In Advances in neural information processing systems, pages 209–215, 2000.

[7] B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD ’03, 2003.

[8] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017.

[9] https://www.cs.toronto.edu/kriz/cifar.html.

[10] https://en.wikipedia.org/wiki/Cosine similarity.

[11] https://data.galaxyzoo.org/gz trees/gz trees.html.

[12] https://deepmind.com/blog/unsupervised-learning/.

[13] H. Dom´ınguez S´anchez, M. Huertas-Company, M. Bernardi, D. Tuccillo, and J. Fischer. Improving galaxy morphologies for sdss with deep learning. Monthly Notices of the Royal Astronomical Society, 2018.

[14] C. B. et.al. Towards interactive curation and automatic tuning of ml pipelines. In DEEM, 2018.

[15] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. Journal of computer and system sciences, 66(4):614–656, 2003.

[16] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 2008.

[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

[18] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5), Aug. 2018.

[19] https://www.zooniverse.org/projects/zookeeper/galaxy-zoo.

[20] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD ’97, Proceedings, 1997.

[21] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), 40(4):11, 2008.

[22] https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon divergence.

[23] M. Kahng et al. Activis: Visual exploration of industry-scale deep neural network models. IEEE TVCG, 2018.

[24] M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 1. ACM, 2016.

[25] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.

[26] https://en.wikipedia.org/wiki/Kernel method.

[27] S. Kornblith, M. Norouzi, H. Lee, and G. E. Hinton. Similarity of neural network representations revisited. In ICML ’19, 2019.

[28] S. Kornblith, M. Norouzi, H. Lee, and G. E. Hinton. Similarity of neural network representations revisited. In ICML ’19, 2019.

[29] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.

[30] Y. Lecun. Gradient-based learning applied to document recognition. 1998.

[31] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu. Towards better analysis of deep convolutional neural networks. IEEE Transactions on Visualization and Computer Graphics, 2017.

[32] A. Mahendran et al. Visualizing deep convolutional neural networks using natural pre-images. IJCV ’16, 2016.

[33] R. Maithra et al. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPs, 2017.

[34] H. Miao, A. Li, L. S. Davis, and A. Deshpande. Modelhub: Deep learn- ing lifecycle management. 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017.

[36] http://yann.lecun.com/exdb/mnist/.

[37] A. S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. In NeurIPS ’18, 2018.

[38] C. Olah et al. Feature visualization. Distill, 2017.

[39] Y. Park, M. J. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, 2016.

[40] J. Platt. Probabilistic outputs for svms and comparisons to regularized likehood methods, advances in large margin classifiers, 1999.

[41] https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning.

[42] http://skyserver.sdss.org/dr7.

[43] T. Sellam, K. Lin, I. Y. Huang, M. Yang, C. Vondrick, and E. Wu. Deepbase: Deep inspection of neural networks. In SIGMOD ’19, Amsterdam, 2019.

[44] R. R. Selvaraju et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. ICCV’17, 2018.

[45] K. Simonyan et al. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR’13, 2013.

[46] D. Smilkov, S. Carter, D. Sculley, F. B. Vi´egas, and M. Wattenberg. Direct-manipulation visualization of deep networks. arXiv preprint arXiv:1708.03788, 2017.

[47] E. Sparks et al. Automating model search for large scale machine learning. In SoCC’15, 2015.

[48] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Trans. Vis. Comput. Graph., 2018.

[49] M. Sundararajan et al. Axiomatic attribution for deep networks. In ICML, 2017.

[50] https://en.wikipedia.org/wiki/Support-vector machine.

[51] http://www.tensorflow.org/.

[52] M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with probabilistic guarantees. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 648–659. VLDB Endowment, 2004.

[53] L. van der Maaten and G. E. Hinton. Visualizing data using t-sne. In Journal of Machine Learning Research 9, Nov, 2008.

[54] M. Vartak. Modeldb: A system for machine learning model management. In CIDR’17, 2017.

[55] M. Vartak, J. M. F. da Trindade, S. Madden, and M. Zaharia. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD Conference, 2018.

[56] http://www.robots.ox.ac.uk/vgg/research/very deep.

[57] H. Wu, C. Wang, J. Yin, K. Lu, and L. Zhu. Interpreting shared deep learning models via explicable boundary trees. ArXiv, abs/1709.03730, 2017.

[58] H. Wu, C. Wang, J. Yin, K. Lu, and L. Zhu. Sharing deep neural network models with interpretation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, 2018.

[59] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi- class classification by pairwise coupling. Journal of Machine Learning Research, 5(Aug):975–1005, 2004.

[60] J. Yosinski, J. Clune, A. M. Nguyen, T. J. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. CoRR, 2015.

[61] M. Zeiler et al. Visualizing and understanding convolutional networks. In ECCV ’14, 2014.