Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications

2020·Arxiv

Abstract

Abstract

With the broader and highly successful usage of machine learning in industry and the sciences, there has been a growing demand for Explainable AI. Interpretability and explanation methods for gaining a better understanding about the problem solving abilities and strategies of nonlinear Machine Learning, in particular, deep neural networks, are therefore receiving increased attention. In this work we aim to (1) provide a timely overview of this active emerging field, with a focus on ‘post-hoc’ explanations, and explain its theoretical foundations, (2) put interpretability algorithms to a test both from a theory and comparative evaluation perspective using extensive simulations, (3) outline best practice aspects i.e. how to best include interpretation methods into the standard usage of machine learning and (4) demonstrate successful usage of explainable AI in a representative selection of application scenarios. Finally, we discuss challenges and possible future directions of this exciting foundational field of machine learning.

Index Terms—Interpretability, deep learning, neural networks, black-box models, explainable artificial intelligence (XAI), model transparency.

I. INTRODUCTION

A main goal of machine learning is to learn accurate decision systems respectively predictors that can help automatizing tasks, that would otherwise have to be done by humans. Machine Learning (ML) has supplied a wealth of algorithms that have demonstrated important successes in the sciences and industry; most popular ML work horses are considered to be kernel methods (e.g. [190], [164], [132], [163], [194])

W. Samek is with the Dept. of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany, and with BIFOLD – Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany. (e-mail: wojciech.samek@hhi.fraunhofer.de).

G. Montavon and C. Anders are with the Machine Learning Group, Technische Universit¨at Berlin, 10587 Berlin, Germany, and with BIFOLD – Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany. (e-mail: gregoire.montavon@tu-berlin.de).

S. Lapuschkin is with the Dept. of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany.

K.-R. M¨uller is with the Machine Learning Group, Technische Universit¨at Berlin, 10587 Berlin, Germany, and also with the Dept. of Artificial Intelligence, Korea University, Seoul 136-713, South Korea, the Max Planck Institute for Informatics, 66123 Saarbr¨ucken, Germany, and BIFOLD – Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany. (e-mail: klaus-robert.mueller@tu-berlin.de).

© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

and particularly during the last decade deep learning methods (e.g. [23], [52], [108], [107], [161], [70]) have gained highest popularity.

As ML is increasingly used in real-world applications, a general consensus has emerged that high prediction accuracy alone may not be sufficient in practice [104], [28], [159]. Instead, in practical engineering of systems, critical features that are typically considered beyond excellent prediction itself are (a) robustness of the system to measurement artefacts or adversarial perturbations [182], (b) resilience to drifting data distributions [49], (c) ability to accurately assess the confidence of its own predictions [139], [135], (d) safety and security aspects [21], [84], [26], [193], (e) legal requirements or adherence to social norms [54], [60], (f) ability to complement human expertise in decision making [82], or (g) ability to reveal to the user the interesting correlations it has found in the data [88], [165].

Orthogonal to the quest for better and more holistic machine learning models, Explainable AI (XAI) [159], [73], [112], [128], [18] has developed as a subfield of machine learning that seeks to augment the training process, the learned representations and the decisions with human-interpretable explanations. An example is medical diagnosis, where the input examples (e.g. histopathological images) come with various artifacts (e.g. stemming from image quality or suboptimal annotations) that have in principle nothing to do with the diagnostic task, yet, due to the limited amount of available data, the ML model may harvest specifically these spurious correlations with the prediction target (e.g. [61], [177]). Here interpretability could point at anomalous or awkward decision strategies before harm is caused in a later usage as a diagnostic tool.

Similarly essential when using ML in the sciences is again interpretabilty, since ideally, the transparent ML model — having learned from data — may have embodied scientific knowledge that would subsequently provide insight to the scientist, occasionally this can even be novel scientific insight (see e.g. [165]). — Note that in numerous scientific applications it has been most common so far to use linear models [151], favoring interpretabilty often at the expense of predictivity (see e.g. [63], [117]).

To summarize, there is a strong push toward better understanding ML systems that are being used and in consequence blackbox algorithms are more and more abandoned for many applications. This growing consensus has led to a strong growth of a subfield of ML, namely explainable AI (XAI) that strives to produce transparent nonlinear learning methods, and supplies novel theoretical perspectives on machine learning models, along with powerful practical tools for a better understanding and interpretation of AI systems.

In this review paper, we will summarize the recent exciting developments, present different classes of XAI methods that have been proposed in the context of deep neural networks, provide theoretical insights, and highlight the current best practices when applying these methods. Note finally, that we do not attempt an encyclopedic treatment of all available XAI literature, rather, we present a slightly biased point of view (and in doing so we often draw from the work of the authors). In particular, we focus on post-hoc explanation methods, which take any model, typically the best performing one, and analyze it in a second step in order to uncover its decision strategy. We also provide — to the best of our knowledge — reference to other related works for further reading.

II. TOWARDS EXPLAINING DEEP NEURAL NETWORKS

Before discussing aspects of the problem of explanation that are specific to deep neural networks, we first introduce some basics of Explainable AI, which apply to a fairly general class of machine learning models. The ML model will be assumed to be already trained and the input-output relation it implements will be abstracted by some function:

This function receives as input a vector of real-valued features typically corresponding to various sensor measurements. The function produces as an output a real-valued score on which the decision is based. A sketch of such function receiving two features and as input is given in Fig. 1.

Fig. 1. Example of a nonlinear function of the input features, which produces some prediction. The function can be approximated locally as a linear model.

In the context of ML classification, the function output can be interpreted as the amount of evidence for / against deciding in favor of a certain class. A classification decision is then obtained from the output score by testing whether the latter is above a certain threshold, or for multiclass problems, larger than the output score of other functions representing the remaining classes.

In a medical scenario, the function may receive as input an array of clinical variables, and the output of the function may represent the evidence for a certain medical condition [96]. In an engineering setting, the input could be the composition of some compound material, and the output could be a prediction of its strength [197] or stability.

Suppose a given instance is predicted by the machine learning model to be healthy, or a compound material is predicted to have high strength. We may choose to trust the prediction and go ahead with next step within an application scenario. However, we may benefit from taking a closer look at that prediction, e.g. to verify that the prediction ‘healthy’ is associated to relevant clinical information, and not some spurious features that accidentally correlate with the predicted quantity in the dataset [101], [104]. Such problem can often be identified by building an explanation of the ML prediction [104].

Conversely, suppose that another instance is predicted by the machine learning model to be of low health or low strength. Here, an explanation could prove equally useful as it could hint at actions to be taken on the sample to improve its predicted score [189], e.g. possible therapies in a medical setting, or small adjustments of the compound design that lead to higher strength.

A. How to Explain: Global vs. Local

Numerous approaches have emerged to shed light onto machine learning predictions. Certain approaches such as activation-maximization [174], [136], [134] aim at a global interpretation of the model, by identifying prototypical cases for the output quantity

and allowing in principle to verify that the function has a high value only for the valid cases. While these prototypical cases may be interesting per se, both for model validation or knowledge discovery, such prototypes will be of little use to understand for a given example x (say, near the decision boundary) what features play in favor or against the model output f(x).

Specifically, we would like to know for that very example what input features contribute positively or negatively to the given prediction. These local analyses of the decision function have received growing attention and many approaches have been proposed [14], [201], [13], [147], [183]. For simple models with limited nonlinearity, the decision function can be approximated locally as the linear function [13]:

where is some nearby root point (cf. Fig. 1). This expansion takes the form of a weighted sum over the input features, where the summand can be interpreted as the contribution of feature i to the prediction. Specifically, an inspection of the summands reveals that a feature will be attributed strong relevance if the following two conditions are met: (1) the feature must be expressed in the data, i.e. it differs from the reference value , and (2) the model output should be sensitive to the presence of that feature, i.e. . An explanation for the prediction can then be formed by the vector of relevance scores . It can be given to the user as a histogram over the input features or as a heatmap.

For illustration, consider the problem of explaining a prediction for a data point from the Concrete Compressive Strength dataset [197]. For this data point, a simple two-layer neural network model predicts a low compressive strength. Applying the analysis above gives an explanation for this prediction, which we show in Fig. 2.

Fig. 2. Input example predicted to have low compressive strength, and a feature-wise explanation of the prediction. Red and blue color indicate positive and negative contributions.

For this example low cement concentration and below average age are factors of low compressive strength, although this is partly compensated by a high quantity of blast furnace slag.

Furthermore, for an explanation to be interpretable by its receiver, the latter must be able to make sense of the input features. Some features such as ‘cement’, ‘water’, and ‘age’, are understandable to everyone, however, more technical terms such as ‘blast furnace slag’ or ‘superplaticizer’ may only be accessible to a domain expert. Therefore, when using these explanation techniques, we make the implicit assumption that those input features are interpretable to the receiver.

B. Deep Networks and the Difficulty of Explaining Them

In practice, linear models or shallow neural networks may not be sufficiently expressive to predict the task optimally. Deep neural networks (DNNs) have been proposed as a way of producing more predictive models. They can be abstracted as a sequence of layers

where each layer applies a linear transformation followed by an element-wise nonlinearity. Combining a large number of these layers endows the model with high prediction power. DNNs have proven especially successful on computer vision tasks [99], [175], [65]. However, DNN models are also much more complex and nonlinear, and quantities entering into the simple explanation model of Eq. (1) become considerably harder to compute and to estimate reliably.

A first difficulty comes from the multiscale and distributed nature of neural network representations. Some neurons are activated for only a few data points, whereas others apply more globally. The prediction is thus a sum of local and global effects, which makes it difficult (or impossible) to find a root point that linearly expands to the prediction for the data point of interest. The transition from the global to local effect indeed introduces a nonlinearity, which Eq. (1) cannot capture.

A second source of instability arises from the high depth of recent neural networks, where a ‘shattered gradient’ effect was observed [16], noting that the gradient locally resembles white noise. In particular, it can be shown that for deep rectifier networks, the number of discontinuities of the gradient can grow in the worst case exponentially with depth [129]. The shattered gradient effect is illustrated in Fig. 3 (left) for the well-established VGG-16 network [175]: The network is fed multiple consecutive video frames of an athlete lifting a barbell, and we observe the prediction for the output neuron ‘barbell’. The gradient of the prediction is changing its value much more quickly than the prediction itself. An explanation based on such gradient would therefore inherit this noise.

Fig. 3. Two difficulties encountered when explaining DNNs. Left: Shattered gradient effect causing gradients to be highly varying and too noisy to be used for explanation. Right: Pathological minima in the function, making it difficult to search for meaningful reference points.

A last difficulty comes from the challenge of searching for a root point on which to base the explanation, that is both close to the data and not an adversarial example [53], [135]. The problem is illustrated in Fig. 3 (right), where we showcase a reference point that does not carry any meaningful visual difference to the original data x, but for which the function output has changed dramatically. The problem of adversarial examples can be explained by the gradient noise, that causes the model to ‘overreact’ to certain pixel-wise perturbations, and also by the high dimensionality of the data (50176 pixels for VGG-16 and the ImageNet data set) where many small pixel-wise effects cumulate into a large effect on the model output.

III. PRACTICAL METHODS FOR EXPLAINING DNNS

In view of the multiple challenges posed by analyzing deep neural network functions, building robust and practical methods to explain their decisions has developed into an own research area [128], [59], [159] and an abundance of methods have been proposed. In parallel, efficient software (cf. Appendix C for a list) makes these newly developed methods readily usable in practice, and allows researchers to perform systematic comparisons between them on reference models and datasets.

In this section, we focus on four families of explanation techniques: Interpretable Local Surrogates, Occlusion Analysis, Gradient-based techniques, and Layer-Wise Relevance Propagation. In our view, these techniques exemplify the current diversity of possible approaches to explaining predictions in terms of input features, and taken together provide a broad coverage of the types of models to explain and the practical use cases. We give references to further related methods in the corresponding subsections. Table III in Appendix C provides a glossary of all referenced methods.

A. Interpretable Local Surrogates [147]

This category of methods aims to replace the decision function by a local surrogate model that is structured in a way that it is self-explanatory (an example of a self-explanatory model is the linear model). This approach is embodied in the LIME algorithm [147], which was successfully applied to DNN classifiers for images and text. Explanation can be achieved by first defining some local distribution around our data point x, learning the parameter v of the linear model that best matches the function locally:

and then extracting local feature contributions, e.g. . Because the method does not rely on the gradient of the original DNN model, it avoids some of the difficulties discussed in Section II-B. The LIME method also covers the incorporation of sparsity or simple decision trees to the surrogate model to further enhance interpretability. Additionally, the learned surrogate model may be based on its own set of interpretable features, allowing to produce explanations in terms of features that are maximally interpretable to the human. Other methods that explain by building a local surrogate include LORE [58] and Anchors [148]. Furthermore, a broader set of methods do not consider a specific location in the input space and builds instead a global surrogate model of the decision function, where the surrogate model readily incorporates interpretability structures. We discuss these global ‘self-explainable’ models in Section III-E.

B. Occlusion Analysis [201]

Occlusion Analysis is a particular type of perturbation analysis where we repeatedly test the effect on the neural network output, of occluding patches or individual features in the input image [201], [208]:

where is an indicator vector for the patch or feature to remove, and ‘’ denotes the element-wise product. A heatmap can be built from these scores highlighting locations where the occlusion has caused the strongest decrease of the function. Because occlusion may produce visual artefacts, inpainting occluded patterns (e.g. using a generative model [2]) rather than setting them to gray was proposed as an enhancement.

Attribution based on Shapley values [116], [179] (see Section V-A for a definition), can also be seen as an occlusion analysis. Here, instead of occluding features one at a time, a much broader set of occlusion patterns are considered, and this has the effect of also integrating global effects in the explanation. SHAP and Kernel SHAP [116] are practical algorithms to approximate Shapley values, that sample a few occlusions according to the probability distribution used to compute Shapley values, and then fit a linear surrogate model that correctly predicts the effect of these occlusions on the output. An explanation can then be easily extracted, and this explanation retains some similarity with the original Shapley values.

A further extension of occlusion analysis is Meaningful Perturbation [47], where an occluding pattern is synthesized, subject to a sparsity constraint, in order to engender the maximum drop of the function value f. The explanation is then readily given by the synthesized pattern. The perturbationbased approach was latter embedded in a rate distortion theoretical framework [118].

C. Integrated Gradients / SmoothGrad [183], [176]

Integrated Gradients [183] explains by integrating the gradient along some trajectory in input space connecting some root point to the data point x. The integration process addresses the problem of locality of the gradient information (cf. Section II-B), making it well-suited for explaining functions that have multiple scales. In the simplest form, the trajectory is chosen to be the segment connecting some root point to the data. Integrated gradients defines feature-wise scores as:

It can be shown that these scores satisfy and thus constitute a complete explanation. If necessary, the method can be easily extended to any trajectories in input space. For implementation purposes, integrated gradients must be discretized. Specifically, the continuous trajectory is approximated by a sequence of data points . Integrated gradients is then implemented as shown in Algorithm 1.

The gradient can easily be computed using automatic differentiation. The larger the number of discretization steps, the closer the output gets to the integral form, but the more computationally expensive the procedure gets.

Another popular gradient-based explanation method is SmoothGrad [176]. The function’s gradient is averaged over a large number of locations corresponding to small random perturbations of the original data point x:

Like the method’s name suggests, the averaging process ‘smoothes’ the explanation, and in turn also addresses the shattered gradient problem described in Section II-B. (See also [130], [14], [174] for earlier gradient-based explanation techniques).

In Section IV, we experiment with a combination of Integrated Gradients and SmoothGrad [176], similar to Expected Gradients (cf. [181]), where relevance scores obtained from Integrated Gradients are averaged over several integration paths that are drawn from some random distribution. The resulting method preserves the advantages of Integrated Gradients and further reduces the gradient noise.

D. Layer-Wise Relevance Propagation (LRP) [13]

The Layer-wise Relevance Propagation (LRP) method [13] makes explicit use of the layered structure of the neural network and operates in an iterative manner to produce the explanation. Consider the neural network

First, activations at each layer of the neural network are computed until we reach the output layer. The activation score in the output layer forms the prediction. Then, a reverse propagation pass is applied, where the output score is progressively redistributed, layer after layer, until the input variables are reached. The redistribution process follows a conservation principle analogous to Kirchoff’s laws in electrical circuits. Specifically, all ‘relevance’ that flows into a neuron at a given layer flows out towards the neurons of the layer below. At a high level, the LRP procedure can be implemented as a forward-backward loop, as shown in Algorithm 2.

The function relprop performs redistribution from one layer to the layer below and is based on ‘propagation rules’ defining the exact redistribution policy. Examples of propagation rules are given later in this section, and their implementation is provided in Appendix B. The LRP procedure is shown graphically in Fig. 4.

While LRP can in principle be performed in any forward computational graph, a class of neural networks which is often encountered in practice, and for which LRP comes with efficient propagation rules that can be theoretically justified (cf. Section V) is deep rectifier networks [51]. The latter can be in large part abstracted as an interconnection of neurons of the type:

where denotes some input activation, and is the weight connecting neuron j to neuron k in the layer above. The

Fig. 4. Illustration of the LRP propagation procedure applied to a neural network. The prediction at the output is propagated backward in the network, using various propagation rules, until the input features are reached. The propagation flow is shown in red.

notation indicates that we sum over all neurons j in the lower layer plus a bias term with . For this class of networks, various propagation rules have been proposed (cf. Fig. 4). For example, the LRP-rule [126] defined as

redistributes based on the contribution of lower-layer neurons to the given neuron activation, with a preference for positive contributions over negative contributions. This makes it particularly robust and suitable for the lower-layer convolutions. Other propagation rules such as LRP-or LRP-0 are suitable for other layers [126]. Additional propagation rules have been proposed for special layers such as min/max pooling [13], [127], [86] and LSTM blocks [11], [9]. Furthermore, a number of other propagation techniques have been proposed [171], [170], [100], [202] with some of the rules overlapping with LRP for certain choices of parameters. For a technical overview of LRP including a discussion of the various propagation rules and further recent heuristics, see [126].

An inspection of Eq. (2) shows an important property of LRP, that of conserving relevance from layer to layer, in particular, we can show that in absence of bias terms, . A further interesting property of this propagation rule is ‘smoothing’: Consider the relevance can be written as and a product of activations and factors. Those factors can be directly related by the equation

This equation can be interpreted as a smooth variant of the chain rule for derivatives used for computing the neural network gradient [125]. Thus, analogous to SmoothGrad [176], LRP also performs some gradient smoothing, however, it embeds it tightly into the deep architecture, so that only a single backward pass is required. In addition to smoothing, Eq. (3) can also be interpreted as a gradient that has been biased to positive values, an idea also found in methods such as DeconvNet [201] or Guided Backprop [178]. This modified gradient view on LRP can also be leveraged to achieve a simpler and more general implementation of LRP based on ‘forward hooks’, which we describe in the second part of Appendix B, and which we use to apply LRP on VGG-16 [175] and ResNet-50 [65] in Section IV.

E. Other Methods

We discuss in this section several other popular Explainable AI approaches, that either do not fall in the category of post-hoc explanation approaches (and therefore are not covered in the sections above), that are specialized for a particular neural network architecture, or that make use of different units of interpretability than the input features.

In contrast to the discussed post-hoc methods that apply to any DNN model, self-explainable models are designed from scratch with interpretability in mind. A self-explainable model can either be trained to solve a machine learning task directly from a supervised dataset, or it can be used to approximate a black-box model on some representative input distribution. Examples of self-explainable models include simple linear models, or specific nonlinear models, e.g. neural networks with an explicit top-level sum-pooling structure [143], [111], [28], [206], [25]. In all of these models, each summand is linked only to one of a few input variables, which makes attribution of their prediction on the input variables straightforward. More complex architectures involving attention mechanisms were also proposed [105], [15], [195], and inspection of the attention mechanism itself can also deliver useful insights into the model prediction. While self-explainable models can be useful for many real-world tasks (a list of arguments in favor of these models can be found e.g. in [154]), their applicability becomes more limited when the goal is to explain the strategy of some existing black-box model. In such scenario, one would have to achieve the difficult task of closely replicating the black-box model for every possible input and perturbation of it, while at the same time being constrained by the predefined interpretable structure.

Other methods are specialized for a particular deep neural network model for which generic explanation methods do not provide a direct solution. One such model is the graph neural network [160], [91], where the graph adjacency matrix given as input does not appear as it is usually the case in the first layer, but instead at every layer. Methods that have been proposed to explain these particular neural networks include the GNNExplainer [198], or GNN-LRP [162]. Other neural network architectures have a more conventional structure but still require a non-trivial adaptation of existing explanation methods, for example, extensions of LRP have been proposed to deal with the special LSTM blocks in recurrent neural networks [11], [9] or to handle attention units in the context of neural machine translation [38].

Further methods do not seek to explain in terms of input features, but in terms of the latent space, where the directions in the latent space code for higher-level concepts, such as color, material, object part, or object [205], [17], [18]. In particular, the TCAV method [89] produces a latent-space explanation for every individual prediction. Some techniques integrate multiple levels of abstraction (e.g. different layers of the neural networks), to arrive at more informative explanation of the prediction process [203], [173], [162]. Finally, generative approaches have been proposed to build structured textual explanations of a machine learning model [67], [113].

IV. COMPARING EXPLANATION METHODS

The methods presented in Section III highlight the variety of approaches available for attributing the prediction of a deep neural network to its input features. This variety of techniques also translates into a variety of qualities of explanations. Illustrative examples of images and the explanation of predicted evidence for the ground truth class as produced by the different explanation methods are shown in Fig. 5. Occlusion Analysis is performed by occluding patches of size pixels with stride 16. Integrated Gradients performs 5 integration steps starting from 5 random points near the origin in order to add smoothing (cf. Appendix A), resulting in 25 function evaluations. LRP explanations are obtained by applying the same LRP rules as in [126]. We observe the following qualitative properties of the explanations: Occlusion-based explanations are coarse and are indicating relevant regions rather than the relevant pixel features. Integrated Gradients produces very fine pixel-wise explanations containing both substantial amounts of evidence in favor and against the prediction (red and blue pixels). LRP preserves the fine explanation structure but tends to produce less negative scores and attributes relevance to whole features rather than individual pixels.

Fig. 5. Examples of images from ImageNet [157] with classes ‘space bar‘, ‘beacon/lighthouse‘, ‘snow mobile‘, ‘viaduct‘, ‘greater swiss mountain dog‘. Images are correctly predicted by the VGG-16 [175] neural network, and shown along with an explanation of the predictions. Different explanation methods lead to different qualities of explanation.

In practice, it is important to reach an objective assessment of how good an explanation is. Unfortunately, evaluating explanations is made difficult by the fact that it is generally impossible to collect ‘ground truth’ explanations. Building such ground truth explanations would indeed require the expert to understand how the deep neural network decides.

Standard machine learning models are usually evaluated by the utility (expected risk) of their decision behavior (e.g. [190]). Transposing this concept of maximizing utility to the domain of explanation, quantifying the utility of the explanation would first require to define what is the ultimate target task (the explanation being only an intermediate step), and then assessing by how much the use of explanation by the human increases its performance on the target task, compared to not using it (see e.g. [14], [62], [41], [159]). Because such end-to-end evaluation schemes are hard to set up in practice, general desiderata for ML explanations have been proposed [184], [123]. Common ones include (1) faithfulness / sufficiency (2) human-interpretability, and (3) possibility to practically apply it to an ML model or an ML task (e.g. algorithmic efficiency of the explanation algorithm).

A. Faithfulness / Sufficiency

A first desideratum of an explanation is to reliably and comprehensively represent the local decision structure of the analyzed ML model. A practical technique to assess such property of the model is ‘pixel-flipping’ [158]. The pixel-flipping procedure tests whether removing the features highlighted by the explanation (as most relevant) leads to a strong decay of the network prediction abilities. The procedure is summarized in Algorithm 3.

Pixel-flipping runs from the most to the least relevant input features, iteratively removing them and monitoring the evolution of the neural network output. The series of recorded decaying prediction scores can be plotted as a curve. The faster the curve decreases, the more faithful the explanation method is w.r.t. the decision of the neural network. The pixel-flipping curve can be computed for a single example, or averaged over a whole dataset in order to get a global estimate of the faithfulness of an explanation algorithm under study.

Fig. 6 applies pixel-flipping to the three considered explanation methods and on two models: VGG-16 [175] and ResNet-50 [65]. At each step of pixel-flipping, removed pixels are imputed using a simple inpainting algorithm, which avoids introducing visual artefacts in the image.

We observe that for all explanation methods, removing relevant features quickly destroys class evidence. In particular, they perform much better than a random explanation baseline.

Fig. 6. Pixel-flipping experiment for testing faithfulness of the explanation. We remove pixels found to be the most relevant by each explanation method and verify how quickly the output of the network decreases.

Fine differences can however be observed between the methods: For example, LRP performs better on VGG-16 than on ResNet-50. This can be explained by VGG-16 having a more explicit structure (standard pooling operations for VGG-16 vs. strided convolution for ResNet-50), which better supports the process of relevance propagation (see also [149] for a discussion of the effect of structure on the performance of explanation methods).

A second observation in Fig. 6 is that Integrated Gradients has by far the highest decay rate initially but stagnates in the later phase of the pixel-flipping procedure. The reason for this effect is that IG focuses on pixels to which the network is the most sensitive, without however being able to identify fully comprehensively the relevant pattern in the image. This effect is illustrated in Fig. 6 (middle) on a zoomed-in exemplary image of class ‘greater swiss mountain dog’, where the image after 10% flipping has lost most of its prediction score, but visually appears almost intact. Effectively, IG has built an adversarial example [185], [135], i.e. an example whose visual content clearly disagrees with the prediction at the output of the network. We note that Occlusion and LRP do not run into such adversarial examples. For these methods, pixel-flipping steadily and comprehensively removes features until class evidence has totally disappeared.

Overall, the pixel-flipping algorithm characterizes various aspects of the faithfulness of an explanation method. We note however that faithfulness of an explanation does not tell us how easy it will be for a human to make sense of that explanation. We address this other key requirement of an explanation in the following section.

B. Human Interpretability

Here, we discuss whether the presented explanation techniques deliver results that are meaningful to the human, i.e. whether the human can gain understanding into the classifier’s decision strategy from the explanation. Human interpretability is hard to define in general [123]. Different users may have different capabilities at reading explanations and at making sense of the features that support them [147], [133]. For example, the layman may wish for a visual interpretation, even approximate, whereas the expert may prefer an explanation supported by a larger vocabulary, including precise scientific or technical terms [14].

For the image classification setting, interpretability can be quantified in terms of the amount of information contained in the heatmap (e.g. as measured by the file size). An explanation with a small associated file size is more likely to be interpretable by a human. The table below shows average file sizes (in bytes1) associated to the various explanation techniques and for two neural networks.

We observe that occlusion produces the lowest file size and is therefore the most ‘interpretable’. It indeed only presents to the user rough localization information without going into the details of which exact feature has supported the decision as done e.g. by LRP. On the other side of the interpretability spectrum we find Integrated Gradients. In the explanations this last method produces, every single pixel contains information, and this makes it clearly overwhelming to the human.

In practice, neural networks do not need to be explained in terms of input features. For example, the TCAV method [89] considers directional derivatives in the space of activations (where the directions correspond to higher-level human-interpretable concepts) in place of the input gradient. Similar higher-level interpretations are also possible using the Occlusion and LRP methods, respectively by perturbing groups of activations corresponding at a given layer to a certain concept, or by stopping the LRP procedure at the same layer and pooling scores on some group of neurons representing the desired concept.

C. Applicability and Runtime

Faithfulness and interpretability do not fully characterize the overall usefulness of an explanation method. To characterize usefulness, we also need to determine whether the explanation method is applicable to a range of models that is sufficient large to include the neural network model of interest, and whether explanations can be obtained quickly with finite compute resources.

Occlusion-based explanations are the easiest to implement. These explanations can be obtained for any neural network, even those that are not differentiable. This also includes networks for which we do not have the source code and where we can only access their prediction through some online server. Technically, occlusion can therefore be used to understand the predictions of third-party models such as https://cloud.google.com/vision/ and https://www.clarifai.com/models. Integrated gradients requires instead for each prediction an access to the neural network gradient. Given that most machine learning models are differentiable, this method is widely applicable also for neural networks with complex structures, such as ResNets [65] or SqueezeNets [79]. Integrated Gradients is also easily implemented in state-of-the-art ML frameworks such as PyTorch or TensorFlow, where we can make use of automatic differentiation. LRP assumes that the model is structured as (or can be converted to [85], [86]) a neural network with a canonical sequence of layers, for example, an alternation of linear/convolution layers, ReLU layers, and pooling layers. This stronger requirement and the implementation overhead caused by explicitly accessing the different layers (cf. Appendix B) will however be offset by a last characteristic we consider in this section, which is the computational cost associated producing the explanation. A runtime comparison2 of the three explanation methods studied here is given in the table below (measured in explanations per second).

Occlusion is the slowest method as it requires to reevaluate the function for each occluded patch. For image data, the runtime of Occlusion increases quadratically with the step size, making the obtainment of high-resolution explanations with this method computationally prohibitive. Integrated Gradients inherits pixel-wise resolution from the gradient computation which is O(1) but requires multiple iterations for the integration. The runtime is further increased if performing an additional loop of smoothing. LRP is the fastest method in our benchmark by an order of magnitude. The LRP runtime is only approximately three times higher than that of computing a single forward pass. This makes LRP particularly convenient for the large-scale analyses we introduce in Section VI-D where an explanation needs to be produced for every single example in the dataset.

V. THEORETICAL FOUNDATIONS OF EXPLANATION METHODS

In parallel to developing explanation methods that address application requirements such as faithfulness, interpretability, usability and runtime, some works have focused on building theoretical foundations for the problem of explanation [127], [116] and establishing theoretical connections between the different methods [171], [4], [128].

Here, we present three frameworks: the Shapley Values [169], [179], [116] which comes from game theory, the Taylor expansions [13], [19], and the Deep Taylor Decomposition [127], which applies Taylor expansions repeatedly at each layer of a DNN. We then show how Occlusion, Integrated Gradients, or LRP intersect for certain choices of parameters with these mathematical approaches.

A. Shapley Values

Shapley values [169] is a framework originally proposed in the context of game theory to determine individual contributions of a set of cooperating players P. The method considers every subset of cooperating players and tests the effect of removing/adding the player i to S on the total payoff v(S) obtained by S if they cooperate. Specifically, Shapley values identify the contribution of player i to the overall coalition P to be:

where each subset S is weighted by the factor . Shapley values satisfy a number of axioms, in particular, efficiency ()), symmetry, linearity, and zero added value of a dummy player. Shapley values are in fact the unique assignment strategy that jointly satisfies these axioms [169].

When transposing the method to the task of explaining a machine learning model [179], [116], the players of the cooperating game become the input features, and the payoff function becomes related to the DNN output. In [116], the payoff function is chosen to be the conditional expectation: . Alternately, to make the score depend only on the model without assuming a specific input distribution, the payoff function can be set to

i.e. input features not in S are set to zero (see e.g. [5]), and we will use this formulation to make connections to practical explanation methods in Section V-D. Note that Shapley values make almost no assumptions about the structure of the function f and can therefore serve as a general theoretical framework to analyze explanation methods. For specific functions, e.g. additive models of the type , Shapley values (using Eq. (4)) take the simple form .

B. Taylor Decomposition

Taylor expansions are a well-known mathematical framework to decompose a function into a series of terms associated to different degrees and combinations of input variables. Unlike Shapley values which evaluates the function f(x) multiple times, the Taylor expansion framework for explaining a ML model [13], [19], [42] evaluates the function once at some reference point and assigns feature contributions by locally extracting the gradient (and higher-order derivatives). Specifi-cally, the Taylor expansion of some smooth and differentiable function at some reference point is given by:

where and denote the gradient and the Hessian respectively, and . . . denote the (non-expanded) higher-order terms. The zero-order term is the function value at the reference point and is zero if choosing a root point. There are as many first-order terms as there are dimensions and each of them is bound to a particular input variable. Thus, they offer a natural way of attributing a function value f(x) onto individual linear components. There are as many second-order terms as there are pairs of ordered variables, and even more third-order and higher-order terms. When the function is approximately locally linear, second and higher-order terms can be ignored, and we get the following simple attribution scheme:

a product of the gradient and the input relative to our root point. In the general case, there are no closed-form approach to find the root point and it is instead obtained using an optimization technique.

C. Deep Taylor Decomposition

An alternate way of formalizing the problem of attribution of a function onto input features is offered by the recent framework of Deep Taylor Decomposition (DTD) [127]. Deep Taylor Decomposition assumes the function is structured as a deep neural network and seeks to attribute the prediction onto input features by performing a Taylor decomposition at every neuron of each layer instead of directly on the whole neural network function. Deep Taylor decomposition assumes the output score has already been attributed onto some layer of activations and attribution scores are denoted by . Deep Taylor Decomposition then considers the function where is the collection of neuron activations in the layer below. These quantities are illustrated in Fig. 7.

Fig. 7. Graphical illustration of the function that DTD seeks to decompose on the input dimensions. Because is complex, it is often replaced by an analytically more tractable model that only depends on local activations.

The function is typically very complex as it corresponds to a composition of multiple forward and backward computations. This function can however be approximated locally by some ‘relevance model’ , the choice of which will depend on the method we have used for computing . We then compute a Taylor expansion of this function:

The linear terms define ‘messages’ that can be redistributed to neurons in the lower layer, and messages received by a given neuron at a certain layer are summed to form a total relevance score:

here, we have added an index to the root point to make explicit that different root points can be used for expanding different neurons. The redistribution procedure is iterated from the top layer towards the lower layers, until the input features are reached.

D. Connections between Explanation Methods

Having described the Shapley values and the simple and deep Taylor decomposition frameworks, we now present some results from the literature showing how some explanation methods reduce for certain choices of parameters to these frameworks. The different connections we outline here are summarized in Table I.

TABLE I EXAMPLES OF EXPLANATION METHODS APPLIED WITH DIFFERENT PARAMETERS ON DIFFERENT MODELS, AND WHETHER THEY CAN BE EMBEDDED IN EACH OF THE THREE PRESENTED THEORETICAL FRAMEWORKS.

We start by connecting occlusion-based explanations of a linear model to Shapley values and Taylor decomposition.

Proposition 1. When applied to homogeneous linear models (of the type ), occlusion with patch size 1 and replacement value 0 is equivalent to a Taylor decomposition with root point , as well as Shapley values with value function given by Eq. (4).

The first connection is shown by the chain of equations . For the Shapley values, we simply observe , which again gives the same result. Integrated gradients and LRP-0 also yield the same result. Hence, for this simple linear model, all explanation methods behave consistently and in agreement with the existing theoretical frameworks. The connection of explanation methods to Shapley values for linear models was also made in [5].

The connection between Integrated Gradients and Taylor decomposition holds for a broader class of neural network functions, specifically deep rectifier networks (without biases):

Proposition 2. When applied to deep rectifier networks of the type , Integrated Gradients with integration path is equivalent to Taylor decomposition at in the limit .

This can be shown by making the preliminary observation that a deep rectifier network is linear with constant gradient on the segment (0, x] and then applying the chain of equations . This connection, along with the observation that a single gradient evaluation of a deep network can be noisy (cf. Section II-B) speaks against integrating on the segment (0, x]. For this reason, we have opted in the experiments of Section IV to use a smoothed version of IG. A further result shows an equivalence between a ‘naive’ version of LRP (using LRP-0 at every layer) and Taylor decomposition.

Proposition 3. For deep rectifier nets of the type f(x) = , applying LRP-0 at each layer is equivalent to a Taylor decomposition at in the limit .

This result can be derived by taking the LRP formulation of Eq. (3) and setting . This equation then reduces to:

where . This equation is exactly the same as the one that propagates gradients in a deep rectifier network. Hence, the input relevance computed by LRP becomes for which we have already shown the equivalence to simple Taylor decomposition in the proposition above. The connection has been originally made in [171].

Proposition 4. For deep rectifier networks of the type f(x) = , applying LRP-is equivalent to performing one step of deep Taylor decomposition and choosing the nearest root point on the line

1wk⪰0); t ∈ R}.

We choose the relevance model with constant (cf. [126] for a justification). Injecting the root point in the first-order terms of DTD (summands of Eq. (5)) gives:

Rtaγw

where t is resolved using the conservation equation . LRP-0 is a special case of LRP-with . A similar procedure with another choice of reference point gives LRP-(cf. [126]).

VI. EXTENDING EXPLANATIONS

The explanation methods we have presented in the previous sections were applied to a particular class of models (deep neural networks) and produced particular types of explanations (attribution on the input features). We present in the following various extensions that broaden the applicability of these methods and diversify the type of explanation that can be produced. In particular, we will discuss (1) higher-order methods to produce richer explanations involving combination of features, (2) a systematic way of extending explanation methods to nonneural network models e.g. in unsupervised learning where explanations are also needed, (3) a principled way to ensure that explanations of DNN classifiers are class-discriminative, and (4) strategies to go beyond individual explanations to arrive at a general understanding of the ML model.

A. Explaining Beyond Heatmaps

The locally linear structure of deep neural networks lends itself well to the heatmap-based methods we have reviewed in this paper, which we call first-order methods. However, special types of neural networks, e.g. that incorporate products between input or latent variables lose that property. Neural networks with product structures commonly occur for relational tasks such as comparing images [122] or collaborative filtering [66]. Graph neural networks [160], [91] multiply the input connectivity matrix multiple times, and consequently also exhibit product structures. In that case, the neural network is no longer piecewise linear and typically becomes piecewise polynomial with its input.

For illustration, we present the BiLRP method [42], which assumes we have a similarity model built as a dot product on some hidden representation of a deep network:

If all functions are piecewise homogeneous linear, then y can be rewritten as a composition:

with piecewise bilinear. Using deep Taylor decomposition, but at each step applying a second-order Taylor expansion, we arrive conceptually at an attribution of the similarity score to pairs of input features. (Practically, the attribution can be expressed as a product of two branches of LRP computation, hence the name BiLRP.) An example of BiLRP explanation for the similarity of two planes in VGG-16 feature spa