Auditing and Debugging Deep Learning Models via Decision Boundaries: Individual-level and Group-level Analysis

2020·arXiv

ABSTRACT

ABSTRACT

Deep learning models have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. Nevertheless, they are consistently utilized in many applications, consequential to humans’ lives, mostly because of their better performance. Therefore, there is a great need for computational methods that can explain, audit, and debug such models. Here, we use flip points to accomplish these goals for deep learning models with continuous output scores (e.g., computed by softmax), used in social applications. A flip point is any point that lies on the boundary between two output classes: e.g. for a model with a binary yes/no output, a flip point is any input that generates equal scores for “yes" and “no". The flip point closest to a given input is of particular importance because it reveals the least changes in the input that would change a model’s classification, and we show that it is the solution to a well-posed optimization problem. Flip points also enable us to systematically study the decision boundaries of a deep learning classifier. The resulting insight into the decision boundaries of a deep model can clearly explain the model’s output on the individual-level, via an explanation report that is understandable by non-experts. We also develop a procedure to understand and audit model behavior towards groups of people. Flip points can also be used to alter the decision boundaries in order to improve undesirable behaviors. We demonstrate our methods by investigating several models trained on standard datasets used in social applications of machine learning. We also identify the features that are most responsible for particular classifications and misclassifications.

CCS CONCEPTS

Computing methodologies → Machine learning; • Humancentered computing → Social recommendation.

KEYWORDS

Explainable machine learning, neural networks, deep learning, interpretable AI

1 INTRODUCTION

Our focus in this paper is auditing and debugging deep learning models in social applications of machine learning. In these applications, deep learning models are usually trained for a specific task and then used, for example to make decisions or to make predictions. Despite their unprecedented success in performing machine learning tasks accurately and fast, these trained models are often described as black-boxes because they are so complex that their

Figure 1: Example of the kind of information that can be obtained by calculating flip points. We answer questions such as, “For a particular input to a deep learning model, what is the smallest change in a single continuous feature that changes the output of the model? What is the smallest change in a particular set of features that changes the output?"

output is not easily explainable in terms of their inputs. As a result, in many cases, no explanation of decisions based on these models can be provided to those affected by them [38].

This inexplainability becomes problematic when deep learning models are utilized in tasks consequential to human lives, such as in criminal justice, medicine, and business. Independent studies have revealed that many of these black-box models have unacceptable behavior, for example towards features such as race, age, etc. of individuals [24]. Because of this, there have been calls for avoiding deep learning models in high-stakes decision making [21]. Additionally, laws and regulations have been proposed to require decisions made by the machine learning models to be accompanied with clear explanations for the individuals affected by the decisions [31]. Several methods have been developed to explain the outputs of models simpler than deep learning models to non-expert users such as administrators or clinicians [13, 18, 22, 38]. In contrast, existing interpretation methods for deep learning models either

lack the ability to directly communicate with non-expert users or have limitations in their scope, computational ability, or accuracy, as we will explain in the next section.

In the meantime, deep learning is ever more widely used on important applications in order to achieve high accuracy, scalability, etc. Sometimes, deep learning models are utilized even when they do not have a clear advantage over simple models, perhaps to avoid transparency or to preserve the models as proprietary [23]. While it is not easy to draw the line as to where their use is advantageous, it is important to have the computational tools to thoroughly audit the models, provide the required explanations for their outputs, and/or to expose their flaws and biases. It would also be useful to have the tools to change their undesirable behavior.

We provide tools for two levels of auditing: individual-level and group-level. The type of feedback that our methods provide on the individual-level is illustrated in Figure 1 and discussed in Section 4.1; in particular, we identify sets of features that have no effect on the model’s decision and sets that change the decision, and we find the closest input with a different decision. For group-level analysis, we develop methods to audit the behavior of models towards groups of individuals, for example, people with certain race or certain education.

In Section 2, we review the literature and explain the advantages of our method compared to other popular methods such as LIME [19]. In Section 4, we present our computational approach to perform the above tasks, based on investigating and altering the decision boundaries of deep learning models by computing flip points, certain interesting points on those boundaries, defined in Section 3, where we also introduce the concept of constrained flip points. In Section 5, we present our numerical results on three different datasets with societal context. Section 6 compares our methods with other applicable methods in the literature. Finally, in Section 7, we present our conclusions and directions for future work.

2 LITERATURE REVIEW

There have been several approaches proposed for interpreting deep learning models and other black-box models. Here we mention a few papers representative of the field.

Spangher et al. [27] have (independently) defined a flip set as the set of changes in the input that can flip the prediction of a classifier. However, their method is only applicable to linear classifiers such as linear regression models and logistic regression. They use flip sets to explain the least changes in individual inputs but do not go further to interpret the overall behavior of the model or to debug it.

Wachter et al. [32] define counterfactuals as the possible changes in the input that can produce a different output label and use them to explain the decision of a model. However, their closest counterfactual is mathematically ill-defined; for deep learning models with continuous output, there is no "closest point" with a different output label because there are points arbitrarily close to the decision boundary. Moreover, their proposed algorithm uses enumeration, applicable only to a small number of features. Russell [25] later suggested integer programming to solve such optimization problems, but the models used as examples are linear with small

dimensionality, and the closest counterfactual in their formulation is ill-defined.

Some studies have taken a model-agnostic approach to interpreting black-box models. For example, the approach taken by Ribeiro et al. [19], known as LIME, randomly perturbs an input until it obtains points on two sides of a decision boundary and then performs linear regression to estimate the location of the boundary in that vicinity. The simplifying assumption to approximate the decision boundary with hyperplanes can be misleading for deep learning models, as shown by [5, 36]. Hence, the output of the LIME model and its corresponding explanation may actually contradict the output of the original model, as empirically shown by [33]. Another issue in LIME’s approach is the reliance on random perturbations of inputs, which has computational limitations. Lakkaraju et al. [16] have also shown via surveys that such explanations may not be effective in communicating with non-expert users. Our method has an accuracy advantage over LIME, because we find a point exactly on the decision boundary instead of estimating its location via a surrogate linear regression model. Additionally, our explanation report can directly communicate with non-expert users such as credit applicants or clinicians.

There are approaches that create rule-lists based on the classifications of a deep learning model, and then use the obtained rules to explain the outputs [15, 16, 20]. These approaches have serious limitations in terms of scalability and accuracy, mostly because a deep learning model is usually too complex to be emulated via a simple set of if-then rules. For example, the outputs of the if-then rules obtained by [16] are different than the outputs of their neural network for more than 10% of the data points, even though the feature space has only 7 dimensions. The computation time to obtain the rule-list is also in the order of few hours for the 7-feature model, while we provide the explanation report for an input with 88 features in a few seconds.

Koh and Liang [12] and Koh et al. [11] have used influence functions to reveal the importance of individual training data in forming the trained model, but their method cannot be used to explain outputs of the models or to investigate the decision boundaries.

There are studies in deep learning that consider the decision boundaries from other perspectives. For example, Elsayed et al. [3] and Jiang et al. [9] use first-order Taylor series approximation to estimate the distance to decision boundaries for individual inputs, and study the distance in relation to generalization error in deep learning. However, those approximation methods have been shown to be unreliable for nonlinear models [36]. Methods to generate adversarial inputs, for example Fawzi et al. [4], Jetley et al. [8], Moosavi-Dezfooli et al. [17], apply small perturbations to an input until its classification changes, but those methods do not seek the closest point on the decision boundaries. and therefore cannot find the least changes required to change the model’s output. Most recent methods for computing adversarial inputs, such as Ilyas et al. [7] and Tsipras et al. [29], also do not seek points on or near the decision boundaries.

3 DEFINING AND COMPUTING FLIP POINTS

For ease of exposition, in this section we consider a model with two continuous outputs. Extensions to models with multi-class outputs or quantified output is straightforward. We first review the work on flip points in [35] and then define constrained flip points.

3.1 Flip points

Consider a model N that has two continuous outputs convenience, we assume that they are normalized to sum to 1 (e.g., by softmax) and write z = N(x). An output with corre- sponds to one class, for example, “cancerous". Similarly, might be a prediction of “noncancerous", and the prediction for is undefined.

We refer to points on the decision boundary as flip points, and we are particularly interested in the smallest change in a given input x that changes the decision of the model. We can find this closest flip point ˆby solving an optimization problem

where is a norm appropriate to the data. Specific problems might require additional constraints, for example, upper and lower bounds on image data, or integer constraints on features such as gender. It is possible that the solution ˆis not unique, but the minimal distance is always unique.

Our optimization problem can be solved by off-the-shelf or specialized algorithms that determine local minimizers for nonconvex problems. For a neural network, the cost of each iteration in determining a flip point is less than the marginal cost of including one point in one iteration of training the model. For the examples we provide in this paper, computing a closest flip point just takes less than a second on a 2017 Macbook.

Another way of looking at the cost is to observe that the cost of computing a flip point is proportional (with a constant factor in complexity) to the cost of evaluating the output of the model for that input. So, assuming that we want to audit a particular model that is already in use on a computer, that computer would be able to compute the flip point and the explanation report as well. If the auditor wants the closest flip points for an entire dataset, they can be computed in parallel.

See [34] for more details on defining and computing flip points for 2-class, multi-class and quantified output.

3.2 Constrained flip points

Suppose, for a particular input, we are interested in the influence of a single feature on the output of our model. If the feature has discrete values (e.g., “owns home", “rents", “no fixed address"), then, as is well known, we simply evaluate the model for the same input but different values for that feature. If the feature has continuous values, though, we might be interested in the smallest change in that feature that changes the decision of the model. Then to compute this closest constrained flip point we solve the optimization problem of Section 3.1 allowing only that feature to vary. This is a 1-variable optimization problem that can be solved by standard algorithms such as bisection and other methods used for linesearch.

If we want to allow k (continuous or discrete) features to vary, then we find the closest constrained flip point by solving the optimization problem of Section 3.1 but with only these k variables, keeping the other features constant. We solve this problem using the same approaches discussed for computing unconstrained flip points.

Finally, if we allow all features to vary, we solve our original optimization problem, obtaining an unconstrained flip point.

3.3 Two notes on defining flip points

Sometimes datasets have redundant features, e.g., features that are linearly dependent or features that are not related to outputs. Redundant features may not contribute to the predictive power of the model, and including them in training may even lead to overfitting [28]. In our numerical examples, we show that excluding nearly linearly dependent features may improve the generalization of models. So, it can be helpful to study the dependencies prior to training.

Moreover, knowing the dependencies among the features can help in choosing meaningful subsets of features for computing constrained flip points. for example, “income" and “net worth" may be correlated in a dataset. If we choose to vary a subset of features that contains “income" while holding “net worth" constant, the constrained flip point might not be very meaningful.

So for many reasons, it can be desirable to identify the dependencies among the features in a dataset. In our computational examples, we do this using the pivoted QR decomposition [6, Chap. 5] of a data matrix D whose rows are the training data points and whose columns are features. This decomposition reorders the columns, pushing linearly dependent columns (redundant features) to the right and forming

where P is the permutation matrix, Q has orthogonal columns, and R is zero below its main diagonal. The degree of independence of the features can be determined by measuring the matrix condition number of leading principal submatrices of R, or by taking the matrix norm of trailing sets of columns. The numerical rank of D is the dimension of the largest leading principal submatrix of R with a sufficiently small condition number or, equivalently, the smallest number of leading columns that yields a small norm for the trailing columns.

Alternatively, the singular value decomposition (SVD) of D can be used in a similar way [6, Chap. 2]. In this case, the numerical rank is the number of sufficiently large singular values. The SVD will identify principal components (i.e., linear combinations of features in decreasing order of importance), and unimportant ones can be omitted. The most significant combinations of features can be used as training inputs, instead of the original features.

The underlying metric of these matrix decompositions is the Euclidean norm, so they are most easily justified for continuous features measured on a single scale, for example, pixel values in an image. For disparate features, the scale factors used by practitioners to define an appropriate norm for the optimization problem in Section 3.1 can be used to renormalize features before forming D. Leaving the choice of scale factor to practitioners is suggested by Spangher et al. [27] and Wachter et al. [32], too.

4 USING FLIP POINTS TO EXPLAIN, AUDIT AND DEBUG MODELS

4.1 Individual-level auditing: Providing explanations and feedback to users of a model

To generate a report like that in Figure 1, we need to compute flip points and constrained flip points in order to determine the smallest changes in the features that change the model’s output. Algorithm 1 summarizes the use of constrained flip points in generating such a report, giving a user precise information on how individual features and combinations of features influenced the model’s recommendation for a given input. This has not previously been possible.

4.2 Group-level auditing: Studying the behavior of a model towards groups of individuals

It is important to audit and explain the behavior of models, not only on the individual-level, but also towards groups. Groups of interest can be an entire dataset or specific subsets within it, such as people with certain age, gender, education, etc. The information obtained from the group-level analysis can reveal systematic traits or biases in model’s behavior. It can also reveal the role of individual features or combinations of features on the overall behavior of model.

Algorithm 2 presents some of the ways that flip points can yield insight into these matters. By computing the closest flip points for a group of individuals, we obtain the vectors of directions to the decision boundary for them. We call these directions flip directions. Using pivoted QR decomposition or principal component analysis (PCA) on the vectors of directions, we can identify important patterns and traits in a model’s decision making for the group of individuals under study.

For example, consider auditing a cancer prediction model for group of individuals with cancerous tumors. After computing the

flip directions, we can study the patterns of change for that population, e.g., which features have changed most significantly and in which direction.

We can also study the effect of specific features on a model’s decision making for specific groups. For this type of analysis, we compute constrained flip points for the individuals in the group, allowing only the feature(s) of interest to change. We then study patterns in the directions of change. For example, when auditing a model trained to evaluate loan applications, we might examine the effect of age for people who have been denied. We can compute constrained flip points for those individuals, allowing only the feature of age to change, and then study the patterns in flip directions, i.e., in which direction “age" should change and to what extent in order to change the decisions for that population.

We might also want to examine the effect of gender for the same loan application model. To do this, we pair each data point with an identical one but of opposite gender. We compute flip points for all of the inputs and look for patterns: For the paired points whose classification did not change, did the mean/median distance to the decision boundary change significantly? For the points whose classification changed, do the directions to the boundary have any commonalities, as revealed by pivoted QR or principal component analysis (PCA)?

4.3 Debugging a model

If we determine that the model’s behavior is undesirable for a particular set of inputs, we would like to alter the decision boundaries to change that behavior. For example, when there is bias towards a certain feature, it usually means data points are close to decision boundaries in that feature dimension. By computing constrained flip points in that dimension, adding them to the training set with the same label, and retraining, we can push the decision boundaries away from the inputs in that dimension. This tends to change the behavior of models, as we show in our numerical results.

Moving the decision boundaries away from the training data also tends to improve the generalization of deep learning models as reported by Elsayed et al. [3] and Yousefzadeh and O’Leary [35].

It is also possible to create flip points and teach them to the model with a flip label (i.e., 2), in order to define a decision boundary in certain locations.

5 RESULTS

Here, we demonstrate our techniques for explaining, auditing, and debugging deep learning models on three different datasets with societal themes. We use three software packages, NLopt [10], IPOPT [30], and the Optimization Toolbox of MATLAB, as well as our own custom-designed homotopy algorithm [35], to solve the optimization problems. The algorithms almost always converge to the same point. The variety and abundance of global and local optimization algorithms in the above optimization packages give us confidence that we have indeed usually found the closest flip point.

For the two first examples, the FICO challenge and the Credit dataset, we compare our results with two recent papers that have used those datasets. To make the comparison fair and easy, for each dataset we make the same choices about the data (such as cross validation, portion of testing set, etc.) as each of those papers.

5.1 FICO Explainable ML Challenge

This dataset has 10,459 observations with 23 features, and each data point is labeled as “Good" or “Bad" risk. We randomly pick 20% of the data as the testing set and keep the rest as the training set. We regard all features as continuous, since even “months" can be measured that way. The description of features is provided in Appendix A.

5.1.1 Eliminating redundant features. The condition number of the matrix formed from the training set is 653. Pivoted QR factorization finds that features “MSinceMostRecentTradeOpen", “Num-Trades90Ever2DerogPubRec", and “NumInqLast6Mexcl7days" are the most dependent columns; discarding them leads to a training set with condition number 59. Using the data with 20 features, we train a neural network with 5 layers, achieving 72.90% accuracy on the testing set. A similar network trained with all 23 features achieved 70.79% accuracy, confirming the effectiveness of our decision to discard three features.

5.1.2 Individual-level explanations. As an example, consider the first datapoint, corresponding to a person with “Bad" risk performance. The feature values for this data point are provided in Appendix A. The closest (unconstrained) flip point is virtually identical to the data point except in five features, shown in Table 1.

Table 1: Difference in features for data point # 1 in the FICO dataset and its closest flip point.

Next, we allow only a subset of the features to change and compute constrained flip points. We explore the following subspaces:

(1) Only one feature is allowed to change at a time. None of the 20 features is individually capable of flipping the decision of the model.

(2) Pairs of features are allowed to change at a time. Only a few of the pairs (29 out of 190) can flip the output. 13 of these pairs involve the feature “MSinceMostRecentInqexcl7days" as partially reflected in the explanation report of Figure 2.

(3) Combinations of features that share the same measurement scale are allowed to change at a time. We have five distinct groups: features that are measured in “percentage", “number of months", “number of trades", “delinquency measure", and “net fraction burden". The last two feature groups are not capable of flipping the prediction of the model by themselves.

The explanation summary report resulting from these computations is shown in Figure 2. The top two sections show the results of computing constrained flip points, first, points where no constrained flip point exists and the label does not change, and then points with different label. The bottom section displays the unconstrained flip point. This shows that the output of a deep learning model can be explained clearly and accurately to the user to any desired level of detail. The answer to other specific questions can also be found easily by modifying the optimization problem.

We note that the time it takes to find each flip point is only a few milliseconds using a 2017 MacBook, hence this report can be generated in real-time.

5.1.3 Group-level explanations. Using pivoted QR on the matrix of directions between data points labeled “Bad" and their flip points, we find that, individually, the three most influential features are “AverageMInFile", “NumInqLast6M", and “NumBank2 NatlTradesWHighUtilization". Similarly, for the directions that flip a “Good" to a “Bad", the three most influential features are “AverageMInFile", “NumInqLast6M", and “NetFractionRevolvingBurden". In both cases, “ExternalRiskEstimate" has no influence.

Figure 2: A sample explanation report for data point #1 in the FICO dataset, classified by a deep learning model.

We perform PCA analysis on the subset of directions that flip a “Bad" to “Good" risk performance. The first principal component reveals that, for this model, the most prominent features with positive impact are “PercentTradesNeverDelq" and “PercentTradesWBalance", while the features with most negative impact are “MaxDelqEver" and “MSinceMostRecentDelq". These conclusions are similar to the influential features reported by [1], however, our method gives more detailed insights, since it includes an individual-level explanation report and also analysis of the group effects.

5.1.4 Effects of redundant variables. Interestingly, for the model trained on all 23 features, the three most significant individual features in flipping its decisions are “MSinceMostRecentTradeOpen", “NumTrades90Ever2DerogPubRec" and “NumInqLast6Mexcl7days", exactly the three dependent features that we discarded for the reduced model. Thus, the decision of the trained model is more susceptible to changes in the dependent features, compared to changes in the independent features.

This reveals an important vulnerability of machine learning models regarding their training sets. For this dataset, when dependent features are included in the training set, the accuracy on the training set remains the same, but it adversely affects the accuracy on the testing set, i.e., generalization. Additionally, when those redundant features are included, they become the most influential features in flipping the decisions of the model, making the model vulnerable.

5.1.5 Auditing the model using flip directions. Figure 3 shows the directions of change to move from the inputs to the closest flip points for features “NumInqLast6M" and “NetFractionRevolvingBurden", which are the most influential features given by the pivoted QR algorithm. Even though flip points are unconstrained, directions of change for these two features are distinctly clustered for flipping a “Bad" label to “Good" and vice versa.

Figure 3: Directions between the inputs and their closest flip point for two influential features. Points are distinctly clustered based on the direction of the flip.

Furthermore, Figure 4 shows the directions in coordinates of the first two principal components. We can see that the flip directions are clearly clustered into two convex cones, exactly in opposite directions. Also, we see that misclassified inputs are relatively close to their flip points while correct predictions can be close or far.

Figure 4: Change between the inputs and their unconstrained flip points in the first two principal components. Directions are clustered into two convex cones, exactly in opposite directions.

5.1.6 Comparison. The interpretable model developed by Chen et al. [1] reports the most influential features which are similar to our findings above, e.g., “PercentTradesNeverDelq" and “AverageMInFile". However, their model is inherently interpretable, and their auditing method is not applicable to deep learning models. They also do not provide an explanation report on the individual-level, like the one we provided in Figure 2.

We note that our goal, here, is to show how a deep learning model utilized for this application can be audited. We do not necessarily advocate for use of deep learning models over other models.

5.2 Default of credit card clients

This dataset from the UCI Machine Learning Repository [2] has 30,000 observations, 24 features, and a binary label predicting whether or not the person will default on the next payment.

We binarize the categorical variables “Gender", “Education", and “Marital status"; the categories that are active for a data point have binary value of 1 in their corresponding features, while the other features are set to zero. When searching for a flip point, we allow exactly one binary feature to be equal to 1 for each of the categorical variables. The condition number of the training set is 129 which implies linear independence of features. Using a 10-fold cross validation on the data, we train a neural network with 5 layers (details in Appendix C), to achieve accuracy of 81.8% on the testing set, slightly higher than the accuracy of around 80.6% reported by [27]. When calculating the closest flip points, we require the categorical variables to remain discrete.

5.2.1 Individual-level explanations. We consider the data point #1 in this dataset which is classified as “default", and compose the explanation report shown in Figure 5. When we examine the effect of features, we see that any of 4 features can flip the prediction of the model, individually.

When examining this report for input #1, we find some flaws in the model. For example, in order to flip the prediction of the model to non-default, one option is to reduce the amount of the current bill to -$2,310,000, while reducing the bill to any number larger than that would not flip the prediction.

Requiring any negative balance on the bill is irrational, because as long as the bill is zero, there would be no chance of default. In fact, one would expect the prediction of non-default if the current bill is changed to zero, for any datapoint. But, the training set does not include such examples, and clearly, our model has not learned such an axiom. Requiring the large payment of $24,750 (for 2 months ago) in order to flip the prediction seems questionable, too, considering that the current bill is $3,913.

Therefore, despite the model’s good accuracy on the testing data, the explanation for its prediction reveals flaws in its behavior for data point #1. These flaws would not have been noticed without investigating the decision boundaries. Fortunately, because of our auditing, we know that the model needs to be improved before it is deployed.

5.2.2 Group-level auditing using flip points. Examining the flip points for the training data reveals model characteristics that should be understood by the users. Here is one example.

Figure 5: A sample explanation report for data point #1 in the Credit dataset, predicted to default on the next payment. The deep learning predicts the labels for the testing data, well. But, what it takes to change the prediction of model sometimes does not seem rational.

Gender does not have much influence in the decisions of the model, as only about 0.5% of inputs have a different gender than their flip points. Hence, gender is not an influential feature for this model. This kind of analysis can be performed for all the features, in more detail.

5.2.3 Group-level auditing using flip directions. We perform pivoted QR decomposition on the directions to the closest flip points. The results show that “BILL-AMT3"1 and “BILL-AMT5" are the most influential features, and “Age" has the least influence in changing the predictions. In fact, there is no significant change between the age of any of the inputs and their closest flip points.

5.2.4 Debugging the model using flip points. In both our training and testing sets, about 52% of individuals have age less than 35. Following [27], we remove 70% of the young individuals from the training set, so that they are under-sampled. We keep the testing set as before and obtain 80.83% accuracy on the original testing set. We observe that now, “Age" is the 3rd most influential feature in flipping its decisions. Moreover, PCA analysis shows that lower Age has a negative impact on the “no default" prediction and vice versa.

We consider all the data points in the training set labelled as “default" that have closest flip point with older age, and all the points labelled “no default" that have closest flip point with younger age. We add all those flip points to the training set, with the same label as their corresponding data point, and train a new model. Now Age has become the 11th most influential feature and it is no longer significant in the first principal component of the flip directions; hence, the bias against Age has been reduced. Also, testing accuracy slightly increases to 80.9%.

Adding synthetic data to the training set has great potential to change the behavior of a model, but we cannot rule out unintended consequences. By investigating the influential features and PCA analysis, we see that the model has been altered only with respect to the Age feature, and the overall behavior of model has not changed.

5.2.5 Comparison. Spangher et al. [27] has used a logistic regression model for this dataset, achieving 80.6% accuracy on testing, less than our 81.8% accuracy. Their method for computing flip points is limited to linear models and not applicable to deep learning. They also do not provide an explanation report like the one in Figure 5.

They have reported that under-sampling young individuals from the training set makes their model biased towards young age, similar to ours. However, they do not use flip points to reduce the bias, which we successfully did.

5.3 Adult Income dataset

The Adult dataset from the UCI Machine Learning Repository [2] has a combination of discrete and continuous variables. Each of the 32,561 data points in the training set and 16,281 in the testing set are labeled, indicating whether the individual’s income is greater than 50K annually. There are 6 continuous variables including Age, Years of education, Capital-gain, Capital-loss, and Hours-per-week of work. We binarize the discrete variables: Work-class, Marital status, Occupation, Relationship, Race, Gender, and Native country. Our trained model considers 88 features and achieves accuracy 86.08% on the testing sets comparable to best results in the literature [2]. Our aim here is to show how a trained model can be audited.

5.3.1 Individual-level auditing. As an example, consider the first data point in the testing set, corresponding to a 25-year-old Black Male, with 11th grade education and native country of United States, working 40 hours per week in the Private sector as Machine-operator-inspector and income “50K", correctly classified by the model. He has never married and has a child/children.

We compute the closest flip point for this individual, allowing all the features to change. Table 2 shows the features that have changed for this person in order to flip the model’s classification for him to the high income bracket. Other features such as gender, race, work-class have not changed and are not shown in the table. Directions of change in the features are generally sensible: e.g., working more hours, getting a higher education, working in the Tech-sector, and being older generally have a direct relationship with higher income. Being married instead of being a single parent is also known to have a relationship with higher income.

We further observe that none of the features individually can flip the classification, but certain constrained flip points can provide additional insights about the behavior of the model.

Table 2: Difference in features for Adult dataset testing point #1 and its closest flip point.

Let’s consider the effect of race. The softmax score for this individual is 0.9989 for income “50K". Changing the race does not affect the softmax score more than 0.0007. This observation about softmax score might lead one to believe that the model is neutral about race, at least for this individual. However, that would not be completely accurate in all circumstances, as we will explain. If we keep all features of this individual the same and only change his race to Asian, the closest flip point for him would be the same as before, except for Age of 29.9 and Hours-per-week of 42.3. The differences in flip points for the Black and Asian are not large enough to draw a conclusion.

Let’s now take one step further and fix his education to remain 11th grade and re-examine the effect of race. The resulting closest flip points are shown in Table 3 for two cases: where his race is kept Black and where it is changed to Asian. Clearly, being Asian requires considerably smaller changes in other features in order to reach the decision boundary of the model and flip to the high income class. This shows that race can be an influential feature in model’s classifications of people with low education. Having education above the 12th grade for this individual makes the effect of race negligible.

Table 3: Race can be an influential feature for individuals with low education. Closest flip points for testing point #1 in Adult dataset when education is fixed to 7th grade and race is changed from Black to Asian.

We further observe that gender does not have an effect on the model’s classification for this individual, whether the education is high or low. The effect of other features related to occupation and family can also be studied.

5.3.2 Group-level auditing using flip points. As an example, we consider the group of people with native country of Mexico. About 95% of this population have income “50K" and 77% of them are Male. We compute the closest flip point for this population and investigate the patterns in them and how frequently features have changed from data points to flip points, and in what way.

Let’s consider the effect of gender. 99% of the females in this group have income “50K" and for 40% of them, their closest flip point is Male. Among the Males, however, less than 1% have a Female flip point; some of these are high-income individuals for whom the change in gender flips them to low-income.

Let’s now consider the patterns in flip points that change low income males and females to high-income. For occupation, the most common change is entering into the Tech-sector and the most common exit is from the Farming-fishing occupation. For relationship, the most common change is to being married and the most common exit is from being Not-in-family and Never-married. Among the continuous features, Years of education and Capital-gain have changed most frequently.

5.3.3 Group-level auditing using flip directions. Consider the subset of directions that flip a “50K" income to “> 50K" for the population with native country of Mexico. The first principal component reveals that, for this model and this population, the most prominent features with positive impact are having a higher education, having Capital-gain, and working in the Tech-sector, while the features with most negative impact are being Never-married, being Female, and having Capital-loss. Looking more deeply at the data, pivoted QR decomposition of the matrix of flip directions reveals that some features, such as being Black and native country of Peru have no impact on this flip.

5.3.4 Group-level analysis of flip directions for misclassifi-

cations. Besides studying specific groups of individuals, we can also study the misclassifications of the model. PCA on the flip directions for all the misclassified points in the training set shows that Age has the largest coefficient in the first principal component, followed by Hours-per-week of work. The most significant feature with negative coefficient is having Capital-gain. These features can be considered the most influential in confusing and de-confusing the model. PCA on the flip directions explains how our model is influenced by various features and its vulnerabilities for misclassification. It thus enables us to create inputs that are mistakenly classified for adversarial purposes, as explained by Lakkaraju and Bastani [14] and Slack et al. [26].

6 COMPARISON WITH OTHER INTERPRETATION APPROACHES FOR DEEP LEARNING

Our use of flip points for interpretation and debugging builds on existing methods in the literature but provides more comprehensive capabilities. For example, Spangher et al. [27] compute flip sets only for linear classifiers and do not use them to explain the overall behavior of the model, identify influential features, or debug.

LIME [19] and Anchors [20] rely on sampling around an input in order to investigate decision boundaries, inefficient and less accurate than our approach, and the authors do not propose using their results as we do. LIME provides a coefficient for each feature (representing a hyperplane) which may not be easily understandable by non-experts (e.g., a loan applicant or a clinician), especially when dealing with a combination of discrete and continuous features. LIME’s approach also relies on simplifying assumptions, such as the ability to approximate decision boundaries by hyperplanes, which leads to contradictions between the LIME output and the model output [33], a.k.a. infidelity. So, our method has an accuracy advantage over their method, too. Moreover, their reliance on random perturbations of data points can be considered a computational limitation when applying their method to deep learning models.

The interpretation we provide for nonlinear deep learning models is comparable in quality and extent to the interpretations provided in the literature for simple models. For example, the model suggested by [1] for the FICO Explainable ML dataset reports the most influential features in decision making of their model, similar to our findings in Section 5.1, and investigates the overall behavior of the model, similar to our results for the Adult dataset. But, their methods are not applicable for auditing deep learning models. Moreover, they do not provide a detailed explanation report.

We also show how decision boundaries can be altered to change the behavior of models, an approach not explored for deep learning models.

7 CONCLUSIONS AND FUTURE WORK

We have proposed the computation of flip points in order to explain, debug, and audit deep learning models with continuous output. We demonstrated that computation of the closest flip point for an input to a continuous model provides useful information to the user, explaining why a model produced a particular output and identifying any small changes in the input that would change the output. Flip points also provide useful information to model auditors, exposing bias and revealing patterns in misclassifications. We provide an algorithm to formalize the auditing procedure. Finally, model developers can use flip points in order to alter the decision boundaries and eliminate undesirable behavior of a model.

Our proposed method has accuracy advantages over existing methods in the literature, and it also has practical advantages such as fast interpretation for individual inputs and the ability to communicate with non-expert users (such a loan applicant or a clinician) via an explanation report.

For future work, we would consider models with continuous outputs other than classification models, for example, a model that recommends the dose of a drug for patients. Other directions of research include auditing image classification models, expanding on work in [35], and text analysis models that have a societal impact. Our methods can promote fairness, accountability and transparency in deep learning models.

A DESCRIPTION OF VARIABLES FOR THE FICO DATASET

Table A1: Variable descriptions for the FICO dataset.

The name of each variable for the FICO dataset can be viewed in the first column of Table A1. The second column shows the corresponding description for each variable as provided by FICO. Additionally, the third column of this table shows the value of each

variable for data point #1. Detailed information about the challenge can be found here: https://community.fico.com/s/explainable- machine-learning-challenge.

B CODE

The code along with a readme file and an example procedure are available at https://github.com/roozbeh-yz/auditing.

C INFORMATION ABOUT THE MODELS

Here, we provide more information about the models we have trained and used in Section 5. We have used fully connected feedforward neural networks with up to 6 hidden layers. The number of nodes for the models used for each data set is shown in Table C1. The activation function we have used in the nodes is the error function, as defined in [35]. We have also used softmax on the output layer, and cross entropy for the loss function.

Models are designed using the method described by Yousefzadeh and O’Leary [37].

Table C1: Number of nodes in neural network used for each data set.

REFERENCES

[1] Chaofan Chen, Kangcheng Lin, Cynthia Rudin, Yaron Shaposhnik, Sijia Wang, and Tong Wang. 2018. An Interpretable Model with Globally Consistent Explanations for Credit Risk. arXiv preprint arXiv:1811.12615 (2018).

[2] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http: //archive.ics.uci.edu/ml

[3] Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. 2018. Large margin deep networks for classification. In Advances in Neural Information Processing Systems (NeurIPS 2018). 842–852.

[4] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. 2017. The robustness of deep networks: A geometrical perspective. IEEE Signal Processing Magazine 34, 6 (2017), 50–62.

[5] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, and Stefano Soatto. 2018. Empirical study of the topology and geometry of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3762–3770.

[6] Gene H Golub and Charles F Van Loan. 2012. Matrix Computations (4th ed.). JHU Press, Baltimore.

[7] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial Examples Are Not Bugs, They Are Features. arXiv preprint arXiv:1905.02175 (2019).

[8] Saumya Jetley, Nicholas Lord, and Philip Torr. 2018. With friends like these, who needs adversaries?. In Advances in Neural Information Processing Systems (NeurIPS 2018). 10749–10759.

[9] Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. 2019. Predicting the Generalization Gap in Deep Networks with Margin Distributions, In International Conference on Learning Representations (ICLR 2019). arXiv preprint arXiv:1810.00113.

[10] Steven G. Johnson. 2014. The NLopt Nonlinear-optimization Package, http://ab- initio.mit.edu/nlopt.

[11] Pang Wei Koh, Kai-Siang Ang, Hubert HK Teo, and Percy Liang. 2019. On the Accuracy of Influence Functions for Measuring Group Effects. arXiv preprint arXiv:1905.13289 (2019).

[12] Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In International Conference on Machine Learning (ICML 2017). 1885–1894.

[13] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1902.00006 (2019).

[14] Himabindu Lakkaraju and Osbert Bastani. 2019. " How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations. arXiv preprint arXiv:1911.06473 (2019).

[15] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2017. Interpretable & Explorable Approximations of Black Box Models. arXiv preprint arXiv:1707.01154 (2017).

[16] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2019. Faithful and customizable explanations of black box models. In Artificial Intelligence, Ethics, and Society. http://www.aies-conference.com/wp-content/papers/main/AIES- 19_paper_143.pdf

[17] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2574–2582.

[18] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1802.00682 (2018).

[19] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.

[20] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: HighPrecision Model-Agnostic Explanations. In AAAI Conference on Artificial Intelligence. 1527–1535.

[21] Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215.

[22] Cynthia Rudin and Şeyda Ertekin. 2018. Learning customized and optimized lists of rules with mathematical programming. Mathematical Programming Computation 10, 4 (2018), 659–702.

[23] Cynthia Rudin and Joanna Radin. 2019. Why Are We Using Black Box Models in AI When We Don’t Need To? A Lesson From An Explainable AI Competition. Harvard Data Science Review 1, 2 (2019).

[24] Cynthia Rudin, Caroline Wang, and Beau Coker. 2018. The age of secrecy and unfairness in recidivism prediction. arXiv preprint arXiv:1811.00731 (2018).

[25] Chris Russell. 2019. Efficient Search for Diverse Coherent Explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*

[26] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2019. How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. arXiv preprint arXiv:1911.02508 (2019).

[27] Alexander Spangher, Berk Ustun, and Yang Liu. 2018. Actionable Recourse in Linear Classification. In Proceedings of the 5th Workshop on Fairness, Accountability and Transparency in Machine Learning.

[28] Laura Toloşi and Thomas Lengauer. 2011. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 14 (2011), 1986– 1994.

[29] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. 2019. Robustness May Be at Odds with Accuracy. In International Conference on Learning Representations (ICLR 2019).

[30] Andreas Wächter and Lorenz T Biegler. 2006. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106, 1 (2006), 25–57.

[31] Sandra Wachter and Brent Mittelstadt. 2019. A Right to Reasonable Inferences: Re-Thinking Data Protection Law in the Age of Big Data and AI. Columbia Business Law Review 2 (2019), 494–620.

[32] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2018. Counterfactual Explanations Without Opening The Black Box: Automated Decisions and The GDPR. Harvard Journal of Law & Technology 31, 2 (2018).

[33] Adam White and Artur d’Avila Garcez. 2019. Measurable Counterfactual Local Explanations for Any Classifier. arXiv preprint arXiv:1908.03020 (2019).

[34] Roozbeh Yousefzadeh. 2019. Interpreting Machine Learning Models and Application of Homotopy Methods. Ph.D. Dissertation. University of Maryland, College Park. https://www.cs.umd.edu/users/oleary/RoozbehYousefzadehThesis.pdf

[35] Roozbeh Yousefzadeh and Dianne P O’Leary. 2019. Interpreting Neural Networks Using Flip Points. arXiv preprint arXiv:1903.08789 (2019).

[36] Roozbeh Yousefzadeh and Dianne P O’Leary. 2019. Investigating Decision Boundaries of Trained Neural Networks. arXiv preprint arXiv:1908.02802 (2019).

[37] Roozbeh Yousefzadeh and Dianne P O’Leary. 2019. Refining the Structure of Neural Networks Using Matrix Conditioning. arXiv preprint arXiv:1908.02400 (2019).

[38] Jiaming Zeng, Berk Ustun, and Cynthia Rudin. 2017. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 180, 3 (2017), 689–722.

Designed for Accessibility and to further Open Science