1.1 Interpretability
The recent advance of machine learning methods is partly due to the widespread use of very complicated models, for instance deep neural networks. As an example, the Inception Network (Szegedy et al., 2015) depends on approximately 23 million parameters. While these models achieve and sometimes surpass humanlevel performance on certain tasks (image classification being one of the most famous), they are often perceived as black boxes, with little understanding of how they make individual predictions.
This lack of understanding is a problem for several
Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).
reasons. First, it can be a source of catastrophic errors when these models are deployed in the wild. For instance, for any safety system recognizing cars in images, we want to be absolutely certain that the algorithm is using features related to cars, and not exploiting some artifacts of the images. Second, this opacity prevents these models from being socially accepted. It is important to get a basic understanding of the decision making process to accept it.
Model-agnostic explanation techniques aim to solve this interpretability problem by providing qualitative or quantitative help to understand how black-box algorithms make decisions. Since the global complexity of the black-box models is hard to understand, they often rely on a local point of view, and produce an interpretation for a specific instance. In this article, we focus on such an explanation technique: Local Interpretable Model-Agnostic Explanations (LIME, Ribeiro et al. (2016)).
Figure 1: LIME explanation for object identification in images. We used Inception (Szegedy et al., 2015) as a black-box model. Terrapin, a sort of turtle, is the top label predicted for the image in panel (a). Panel (b) shows the results of LIME, explaining how this prediction was made. The highlighted parts of the image are the superpixels with the top coefficients in the surrogate linear model. We ran the same experiment for the ‘strawberry’ label in panel (c).
1.2 Contributions
Our main goal in this paper is to provide theoretical guarantees for LIME. On the way, we shed light on some interesting behavior of the algorithm in a simple setting. Our analysis is based on the Euclidean version of LIME, called “tabular LIME.” Our main results are the following:
(i). When the model to explain is linear, we compute in closed-form the average coefficients of the surrogate linear model obtained by TabularLIME. (ii). In particular, these coefficients are proportional to the partial derivatives of the black-box model at the instance to explain. This implies that TabularLIME indeed highlights important features. (iii). On the negative side, using the closed-form expressions we show that it is possible to make some important features disappear in the interpretation, just by changing a parameter of the method. (iv). We also compute the local error of the surrogate model, and show that it is bounded away from 0 in general.
We explain how TabularLIME works in more details in Section 2. In Section 3, we state our main results. They are discussed in Section 4, and we provide an outline of the proof of our main result in Section 5. We conclude in Section 6.
2.1 Intuition
From now on, we will consider a particular model encoded as a function and a particular instance
to explain. We make no assumptions on this function, e.g., how it might have been learned. We simply consider f as a black-box model giving us predictions for all points of the input space. Our goal will be to explain the decision
that this model makes for one particular instance
As soon as f is too complicated, it is hopeless to try and fit an interpretable model globally, since the interpretable model will be too simple to capture all the complexity of f. Thus a reasonable course of action is to consider a local point of view, and to explain f in the neighborhood of some fixed instance . This is the main idea behind LIME: To explain a decision for some fixed input
, sample other examples around
, use these samples to build a simple interpretable model in the neighborhood of
, and use this surrogate model to explain the decision for
One additional idea that makes a huge difference with other existing methods is to use discretized features of smaller dimension to build the local model. These new categorical features are easier to interpret, since they are categorical. In the case of images, they are built by using a split of the image
into superpixels (Ren and Malik, 2003). See Figure 1 for an example of LIME output in the case of image classification. In this situation, the surrogate model highlights the superpixels of the image that are the most “active” in predicting a given label.
Whereas LIME is most famous for its results on images, it is easier to understand how it operates and to analyze theoretically on tabular data. In the case of tabular data, LIME works essentially in the same way, with a main difference: tabular LIME requires a train set, and each feature is discretized according to the empirical quantiles of this training set.
Figure 2: General setting of TabularLIME along coordinate j. Given a specific datapoint (in red), we want to build a local model for f (in blue), given new samples
(in black). Discretizing with respect to the quantiles of the distribution (in green), these new samples are transformed into categorical features
purple). In the construction of the surrogate model, they are weighted with respect to their proximity with
(here exponential weights given by Eq. (2.1), in black). In red, we plotted the tangent line, the best linear approximation one could hope for.
We now describe the general operation of LIME on Euclidean data, which we call TabularLIME. We provide synthetic description of TabularLIME in Algorithm 1, and we refer to Figure 2 for a depiction of our setting along a given coordinate. Suppose that we want to explain the prediction of the model f at the instance TabularLIME has an intricate way to sample points in a local neighborhood of
constructs empirical quantiles of the train set on each dimension, for a given number p of bins. These quantile boxes are then used to construct a discrete representation of the data: if
falls between ˆ
and ˆ
, it receives the value k. We now have a discrete version of
, say
. The next step is to sample discrete examples in
uniformly at random: for instance,
means that TabularLIME sampled an encoding such that the first coordinate falls into the first quantile box, the second coordinate into the third, etc. TabularLIME subsequently un-discretizes these encodings by sampling from a normal distribution truncated to the corresponding quantile boxes, obtaining new examples
. For example, for sample
we now sample the first coordinate from a normal distribution restricted to quantile box #1, the second coordinate from quantile box #3, etc. This sampling procedure ensures that we have samples in each part of the space. The next step is to convert these sampled points to binary features, indicating for each coordinate if the new example falls into the same quantile box as
. Here,
would be
. Finally, an interpretable model (say linear) is learned using these binary features.
2.2 Implementation choices and notation
LIME is a quite general framework and leaves some freedom to the user regarding each brick of the algorithm. We now discuss each step of TabularLIME in more detail, presenting our implementation choices and introducing our notation on the way.
Discretization. As said previously, the first step of TabularLIME is to create a partition of the input space using a train set. Intuitively, TabularLIME produces interpretable features by discretizing each dimension. Formally, given a fixed number of bins p, for each feature j, the empirical quantiles ˆputed. Thus, along each dimension, there is a mapping ˆ
associating each real number to the index of the quantile box it belongs to. For any point
, the interpretable features are then defined as a 0
1 vector corresponding to the discretization of x being the same as the discretization of
for all 1
. Intuitively, these categorical features correspond to the absence or presence of interpretable components. The discretization process makes a huge difference with respect to other methods: we lose the obvious link with the gradient of the function, and it is much more complicated to see how the local properties of f influence the result of the LIME algorithm, even in a simple setting. In all our experiments, we took
4 (quartile discretization, the default setting).
Figure 3: A visualization of the train set in dimension 2 with
, and
1. The empirical quantiles (dashed green lines) are already very close to the theoretical quantiles (green lines) for
500. The main difference in the procedure appears if
(red cross) is chosen at the edge of a quantile box, changing the way all the new samples are encoded. But for a train set containing enough observations and a generic
, there is virtually no difference between using the theoretical quantiles and the empirical quantiles.
Sampling strategy. Along with ˆcreates an un-discretization procedure ˆ
R. Simply put, given a coordinate j and a bin index k, ˆ
samples a truncated Gaussian on the corresponding bin, with parameters computed from the training set. The TabularLIME sampling strategy for a new example amounts to (i) sample
variable such that the
are independent samples of the discrete uniform distribution on t1, . . . , pu, and (ii) apply the un-discretization step, that is, return ˆ
. We will denote by
these new examples, and
their discretized counterparts. Note that it is possible to take other bin boxes than those given by the empirical quantiles, the
sampled according to the frequency observed in the dataset. The sampling step of TabularLIME helps to explore the values of the function in the neighborhood of the instance to explain. Thus it is not so important to sample according to the distribution of the data,
and a Gaussian sampling that mimics it is enough.
Assuming that we know the distribution of the train data, it is possible to use the theoretical quantiles instead of the empirical ones. For a large number of examples, they are arbitrary close (see, for instance, Lemma 21.2 in Van der Vaart (2000)). See Figure 3 for an illustration. It is this approach that we will take from now on: we denote the discretization step by and denote the quantiles by
for 1
and 0
to mark this slight difference. Also note that, for every 1
, we set
the quantiles bounding
(see Figure 2).
Train set. TabularLIME requires a train set, which is left free to the user. In spirit, one should sample according to the distribution of the train set used to fit the model f. Nevertheless, this train set is rarely available, and from now on, we choose to consider draws from a . The parameters of this Gaussian can be estimated from the training data that was used for f if available. Thus, in our setting, along each dimension
are the (rescaled) quantiles of the normal distribution. In particular, they are identical for all features. A fundamental consequence is that sampling the new examples
s first and then discretizing has the same distribution as sampling first the bin indices
s and then un-discretizing.
Weights. We choose to give each example the weight
where is the Euclidean norm on
bandwidth parameter. It should be clear that
is a hard parameter to tune:
– if is very large, then all the examples receive positive weights: we are trying to build a simple model that captures the complexity of f at a global scale. This cannot work if f is too complicated.
– if is too small, then only examples in the immediate neighborhood of
receive positive weights. Given the discretization step, this amounts to choosing
for all i. Thus the linear model built on top would just be a constant fit, missing all the relevant information.
Note that other distances than the Euclidean distance can be used, for instance the cosine distance for text data. The default implementation of LIME uses instead of
, with bandwidth set to 0.75d. We choose to use the true Euclidean distance between
and the new examples as it can be seen as a smoothed version of the distance to
and has the same behavior.
Interpretable model. The final step in TabularLIME is to build a local interpretable model. Given a class of simple, interpretable models G, TabularLIME selects the best of these models by solving
where is a local loss function evaluated on the new examples
, and Ω :
is a regularizer function. For instance, a natural choice for the local loss function is the weighted squared loss
We saw in Section 1.1 different possibilities for G. In this paper, we will focus exclusively on the linear models, in our opinion the easiest models to interpret. Namely, we set , with
and
. To get rid of the intercept
, we now use the standard approach to introduce a phantom coordinate 0, and
with
1 and
. We also stack the
s together to obtain
The regularization term Ωpgq is added to insure further interpretability of the model by reducing the number of non-zero coefficients in the linear model given by TabularLIME. Typically, one uses regularization (ridge regression is the default setting of LIME) or
regularization (the Lasso). To simplify the analysis, we will set Ω 0 in the following. We believe that many of the results of Section 3 stay true in a regularized setting, especially the switch-off phenomenon that we are going to describe below: coefficients are even more likely to be set to zero when Ω
In other words, in our case TabularLIME performs weighted linear regression on the interpretable features s, and outputs a vector
Note that is a random quantity, with randomness coming from the sampling of the new examples
. It is clear that from a theoretical point of view, a big hurdle for the theoretical analysis is the discretization process (going from the
Regression vs. classification. To conclude, let us note that TabularLIME can be used both for regression and classification. Here we focus on the regression mode: the outputs of the model are real numbers, and not discrete elements. In some sense, this is a more general setting than the classification case, since the classification mode operates as TabularLIME for regression, but with f chosen as the function that gives the likelihood of belonging to a certain class according to the model.
2.3 Related work
Let us mention a few other model-agnostic methods that share some characteristics with LIME. We refer to Guidotti et al. (2019) for a thorough review.
Shapley values. Following Shapley (1953) the idea is to estimate for each subset of features S the expected prediction difference ∆pSq when the value of these features are fixed to those of the example to explain. The contribution of the jth feature is then set to an average of the contribution of j over all possible coalitions (subgroups of features not containing j). They are used in some recent interpretability work, see Lundberg and Lee (2017) for instance. It is extremely costly to compute, and does not provide much information as soon as the number of features is high. Shapley values share with LIME the idea of quantifying how much a feature contributes to the prediction for a given example.
Gradient methods. Also related to LIME, gradient-based methods as in Baehrens et al. (2010) provide local explanations without knowledge of the model. Essentially, these methods compute the partial derivatives of f at a given example. For images, this can yield satisfying plots where, for instance, the contours of the object appear: a saliency map (Zeiler and Fergus, 2014). Shrikumar et al. (2016, 2017) propose to use the “input derivative” product, showing advantages over gradient methods. But in any case, the output of these gradient based methods is not so interpretable since the number of features is so high. LIME gets around this problem by using a local dictionary with much smaller dimensionality than the input space.
We are now ready to state our main result. Let us denote by the coefficients of the linear surrogate model obtained by TabularLIME. In a nutshell, when the underlying model f is linear, we can derive the average value
coefficients. In particular, we will see that the
s are proportional to the partial derivatives
. The exact form of the proportionality coefficients is given in the formal statement below, it essentially depends on the scaling parameters
and the s, the quantiles left and right of the
Theorem 3.1 (Coefficients of the surrogate model, theoretical values). Assume that f is of the form
where, for any 1 , we defined
and
Let . Then, with high probability greater than 1
, it holds that
A precise statement with the accurate dependencies in the dimension and the constants hidden in the result can be found in the Appendix (Theorem 10.1). Before discussing the consequences of Theorem 3.1 in the next section, remark that since is encoded by
the prediction of the local model at
, is just the sum of the
s. According to Theorem 3.1, ˆ
will be close to this value, with high probability. Thus we also have a statement about the error made by the surrogate model in
Corollary 3.1 (Local error of the surrogate model). Let . Then, under the assumptions of Theorem 3.1, with probability greater than 1
, it holds that
with hidden constants depending on
Obviously the goal of TabularLIME is not to produce a very accurate model, but to provide interpretability. The error of the local model can be seen as a hint about how reliable the interpretation might be.
We now discuss the consequences of Theorem 3.1 and Corollary 3.1.
1 2 3 4 5 6 7 8 9 10 features id
Coefficients of the surrogate model for simple f
Figure 4: Example where the true underlying black box model only depends on two features: 10
10
. For each of the 10 features, we plot the values of the
s obtained by TabularLIME. The blue line shows the median over all experiments, the red cross the
theoretical value according to our theorem. The boxplots contain values between first and third quartiles, the whiskers are 1.5 times the interquartile ranges, and the black dots mark values outside this range. To produce the figure, we made 20 repetitions of the experiment, with
examples and
see that TabularLIME finds nonzero coefficients exactly for the first two coordinates, up to noise coming from the sampling. This is the result that one would hope to achieve, and also the result predicted by our theory.
Dependency in the partial derivatives. A first consequence of Theorem 3.1 is that the coefficients of the linear model given by TabularLIME are approximately proportional to the partial derivatives of , with constant depending on our assumptions. An interesting follow-up is that, if f depends only on a few features, then the partial derivatives in the other coordinates are zero, and the coefficients given by TabularLIME for these coordinates will be 0 as well. For instance, if
10
10
as in Figure 4, then
11
1, and
0 for all
3. In a simple setting, we thus showed that TabularLIME does not produce interpretations with additional erroneous feature dependencies. Indeed, when the number of samples is high, the coordinates which do not influence the prediction will have a coefficient close to the theoretical value 0 in the surrogate linear model. For a bandwidth not too large, this dependency in the partial derivatives seems to hold to some extent for more general functions. See for instance Figure 6, where we demonstrate this phenomenon for a kernel regressor.
Robustness of the explanations. Theorem 3.1 means that, for large n, TabularLIME outputs coeffi-cients that are very close to with high probability, where
is a vector that can be computed explicitly as
Coeff c ents of the surrogate model for a l near model learned on Boston Hous ng
Figure 5: Values of the coefficients obtained by TabularLIME on each coordinate in dimension 13 for a linear model trained on the Boston housing dataset (Harrison Jr. and Rubinfeld, 1978). The
s are concentrated around the red crosses, which denote the
the theoretical values predicted by Theorem 3.1. To produce the figure, we ran 20 experiments with
new samples generated for each run and we set
per Eq. (3.1). Still without looking too closely at the values of , this is already interesting and hints that there is some robustness in the interpretations provided by TabularLIME: given enough samples, the explanation will not jump from one feature to the other. This is a desirable property for any interpretable method, since the user does not want explanations to change randomly with different runs of the algorithm. We illustrate this phenomenon in Figure 5.
Influence of the bandwidth. Unfortunately, Theorem 3.1 does not provide directly a founded way to pick , which would for instance minimize the variance for a given level of noise. The quest for a founded heuristic is still open. However, we gain some interesting insights on the role of
. Namely, for fixed
,
, the multiplicative constants
appearing in Eq. (3.1) depend essentially on
Without looking too much into these constants, one can already see that they regulate the magnitude of the coefficients of the surrogate model in a non-trivial way. For instance, in the experiment depicted in Figure 4, the partial derivative of f along the two first coordinate has the same magnitude, whereas the interpretable coefficient is much larger for the first coordinate than the second. Thus we believe that the value of the coefficients in the obtained linear model should not be taken too much into account.
More disturbing, it is possible to artificially (or by
Coefficients of the surrogate model for a ernel regressor learned on Boston Housing
Figure 6: Values of the coefficients obtained by TabularLIME on each coordinate. We used the same settings as in Figure 5, but this time we train a kernel ridge regressor on the Boston Housing dataset—a nonlinear function. For the ridge regression, we used the Gaussian kernel with scale parameter set to 5 and default regularization constant (1). We then estimated the partial derivatives of f at
and reported the corresponding
s in red. For the chosen bandwidth (we took
1), the experiments seem to roughly agree with our theory.
accident) put to zero, therefore forgetting about feature j in the explanation, whereas it could play an important role in the prediction. To see why, we have to return to the definition of the
s: since
by construction, to have
0 is possible only if
and is set to
. We demonstrate this switching-off phenomenon in Figure 7. An interesting take is that
not only decides at which scale the explanation is made, but also the magnitude of the coefficients in the interpretable model, even for small changes of
Error of the surrogate model. A simple consequence of Corollary 3.1 is that, unless some cancellation happens between in the term error of the surrogate model is bounded away from zero. For instance, as soon as ˜
, it is the general situation. Therefore, the surrogate model produced by TabularLIME is not accurate in general. We show some experimental results in Figure 8.
Finally, we discuss briefly the limitations of Theorem 3.1.
1 2 3 4 5 6 7 8 9 10 features id
Coefficients of the surrogate model for simple f
Figure 7: Values of the coefficients given by LIME. In this experiment, we took exactly the same setting as in Figure 4, but this time set the bandwidth to 53 instead of 1. In that case, the second feature is switched-off by TabularLIME. Note that it is not the case that
is too small and that we are in a degenerated case: TabularLIME still puts a nonzero coefficient on the first coordinate.
Linearity of f. The linearity of f is a quite restrictive assumption, but we think that it is useful to consider for two reasons.
First, the weighted nature of the procedure means that TabularLIME is not considering examples that are too far away from with respect to the scaling parameter
Thus it is truly a local assumption on f, that could be replaced by a boundedness assumption on the Hessian of f in the neighborhood of
, at the price of more technicalities and assuming that
is not too large. See, in particular, Lemma 11.3 in the Appendix, after which we discuss an extension of the proof when f is linear with a second degree perturbative term. We show in Figure 6 how our theoretical predictions behave for a non-linear function (a kernel ridge regressor).
Second, our main concern is to know whether TabularLIME operates correctly in a simple setting, and not to provide bounds for the most general f possible. Indeed, if we can already show imperfect behavior for TabularLIME when f is linear as seen earlier, our guess is that such behavior will only worsen for more complicated f.
Sampling strategy. In our derivation, we use the theoretical quantiles of the Gaussian distribution along each axis, and not prescribed quantiles. We believe that the proof could eventually be adapted, but that the result would loose in clarity. Indeed, the computations for a truncated Gaussian distribution are far more convoluted than for a Gaussian distribution. For instance, in the proof of Lemma 8.1 in the Appendix, some complicated quantities depending on the prescribed
Figure 8: Histogram of the errors ˆ. The setting is the same as in Figure 4, but we repeated the experiment 100 times. The red double arrow marks the value given by Corollary 3.1 around which the local error concentrate. With high probability, the error of the surrogate model is bounded away from 0.
quantiles would appear when computing
In this section, we explain how Theorem 3.1 is obtained. All formal statements and proofs are in the Appendix.
Outline. The main idea underlying the proof is to realize that is the solution of a weighted least squares problem. Denote by Π
the diagonal matrix such that Π
(the weight matrix), and set
the response vector. Then, taking the gradient of Eq. (5.1), one obtains the key equation
Let us define and
, as well as their population counterparts Σ :
and Γ :
. Intuitively, if we can show that
close to Σ and Γ, assuming that Σ is invertible, then we can show that
is close to
The main difficulties in the proof come from the non-linear nature of the new features , introducing tractable but challenging integrals. Fortunately, the Gaussian sampling of LIME allows us to overcome these challenges (at the price of heavy computations).
Covariance matrix. The first part of our analysis is thus concerned with the study of the empirical covariance matrix pΣ. Perhaps surprisingly, it is possible
to compute the population version of
where the s were defined in Section 3, and
is a scaling constant that does not appear in the final result (see Lemma 8.1).
Since the s are always distinct from 0 and 1, the special structure of Σ makes it possible to invert it in closed-form. We show in Lemma 8.2 that
We then achieve control ofconcentration inequalities, since the new samples are Gaussian and the binary features are bounded (see Proposition 8.1).
Right-hand side of Eq. (5.1). Again, despite the non-linear nature of the new features, it is possible to compute the expected version of in our setting. In this case, we show in Lemma 9.1 that
where the s were defined in Section 3. They play an analogous role to the
s but, as noted before, they are signed quantities. As with the analysis of the covariance matrix, since the weights and the new features are bounded, it is possible to show a concentration result for pΓ (see Lemma 9.3).
Concluding the proof. We can now conclude, first upper bounding
and then controlling each of these terms using the previous concentration results. The expression of simply obtained by multiplying Σ
In this paper we provide the first theoretical analysis of LIME, with some good news (LIME discovers interesting features) and bad news (LIME might forget some important features and the surrogate model is not faithful). All our theoretical results are verified by simulations.
For future work, we would like to complement these results in various directions: Our main goal is to extend the current proof to any function by replacing f by its Taylor expansion at . On a more technical side, we would like to extend our proof to other distance functions (e.g., distances between the
, which is the default setting of LIME), to non-isotropic sampling of the
s (that is,
not constant across the dimensions), and to ridge regression.
Acknowledgements
The authors would like to thank Christophe Biernacki for getting them interested in the topic, as well as Leena Chennuru Vankadara for her careful proofreading. This work has been supported by the German Research Foundation through the Institutional Strategy of the University of Tübingen (DFG, ZUK 63), the Cluster of Excellence “Machine Learning—New Perspectives for Science” (EXC 2064/1 number 390727645), and the BMBF Tuebingen AI Center (FKZ: 01IS18039A).
D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(6):1803–1831, 2010.
S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi- annotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, 51(5):93, 2019.
D. Harrison Jr. and D. L. Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81– 102, 1978.
S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017.
D. P. O’Leary and G. W. Stewart. Computing the eigenvalues and eigenvectors of arrowhead matrices. Journal of Computational Physics, 90:497–505, 1996.
X. Ren and J. Malik. Learning a classification model for segmentation. In ICCV, 2003.
M. T. Ribeiro, S. Singh, and C. Guestrin. Why should I trust you? explaining the predictions of any classifier. In SIGKDD, 2016.
L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713, 2016.
A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In ICML, 2017.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
M. J. Wainwright. High-dimensional statistics: a nonasymptotic viewpoint. Cambridge University Press, 2019.
H. Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
M. D. Zeiler and R. Fergus. Visualizing and under- standing convolutional networks. In ECCV, pages 818–833, 2014.
Explaining the Explainer: A First Theoretical Analysis of LIME
In this supplementary material, we provide the proof of Theorem 3.1 of the main paper. It is a simplified version of Theorem 10.1. We first recall our setting in Section 7. Then, following Section 5 of the main paper, we study the covariance matrix in Section 8, and the right-hand side of the key equation (5.1) in Section 9. Finally, we state and prove Theorem 10.1 in Section 10. Some technical results (mainly Gaussian integrals computation) and external concentration results are collected in Section 11.
Let us recall briefly the main assumptions under which we prove Theorem 3.1. Recall that they are discussed in details in Section 2.2 of the main paper.
H1 (Linear f). The black-box model can be written
H2 (Gaussian sampling). The random variables are i.i.d.
Also recall that, for any 1 , we set the weights to
We will need the following scaling constant:
which does not play any role in the final result. One can check that 1 when
, regardless of the dimension.
Finally, for any 1 , recall that we defined
and
where are the quantile boundaries of
. These coefficients are discussed in Section 5 of the main paper. Note that all the expected values are taken with respect to the randomness on the
In this section, we state and prove the intermediate results used to control the covariance matrix . The goal of this section is to obtain the control of
in probability. Intuitively, if this quantity is small enough,
then we can inverse Eq. (5.1) and make very precise statements about
We first show that it is possible to compute the expected covariance matrix in closed form. Without this result, a concentration result would still hold, but it would be much harder to gain precise insights on the
Lemma 8.1 (Expected covariance matrix). Under Assumption 2, the expected value of is given by
Proof. Elementary computations yield
Reading the coefficients of this matrix, we have essentially three computations to complete:
Computation of Since the
s are Gaussian (Assumption 2) and using the definition of the weights (Eq. (7.1)), we can write
By independence across coordinates, the last display amounts to
We then apply Lemma 11.1 to each of the integrals within the product to obtain
We recognize the definition of the scaling constant (Eq. (7.2)): we have proved that
Computation of s are Gaussian (Assumption 2) and using the definition of the weights (Eq. (7.1)),
By independence across coordinates, the last display amounts to
Using Lemma 11.1, we obtain
We recognize the definition of the scaling constant (Eq. (7.2)) and of the coefficient (Eq. (7.3)): we have proved that
Computation of Since the
s are Gaussian (Assumption 2) and using the definition of the weights (Eq. (7.1)),
By independence across coordinates, the last display amounts to
Using Lemma 11.1, we obtain
We recognize the definition of the scaling constant (Eq. (7.2)) and of the alphas (Eq. (7.3)): we have proved that
As it turns out, we show that it is possible to invert Σ in closed-form, therefore simplifying tremendously our quest for control of. Indeed, in most cases, even if concentration could be shown, one would not have a precise idea of the coefficients of Σ
Lemma 8.2 (Inverse of the covariance matrix). If is invertible, and
Proof. Define the vector of the
Then Σ is a block matrix that can be written Σ
Note that, since erf is an increasing function, the s are always distinct from 0 and 1. Thus
invertible matrix, and we can use the block matrix inversion formula to obtain the claimed result.
As a direct consequence of the computation of Σ, we can control its largest eigenvalue.
Lemma 8.3 (Control ofWe have the following bound on the operator norm of the inverse covariance
where
where we used in the last step of the derivation.
Remark 8.1. Better bounds can without doubt be obtained. A step in this direction is to notice that is an arrowhead matrix (O’Leary and Stewart, 1996). Thus the eigenvalues of S are solutions of the secular equation
Further study of this equation could yield an improved statement for Lemma 8.3.
We now show that the empirical covariance matrix concentrates around Σ. It is interesting to see that the non-linear nature of the new coordinates (the s) calls for complicated computations but allows us to use simple concentration tools since they are, in essence, Bernoulli random variables.
Proof. Recall that : it suffices to show the result for the Frobenius norm. Next, we notice that the summands appearing in the entries of
, and
, are all bounded. Indeed, by the definition of the weights and the definition of the new features, they all take values in r0, 1s. Moreover, for given
, they are independent random variables. Thus we can apply Hoeffding’s inequality
We conclude by a union bound on the entries of the matrix.
As a consequence of the two preceding lemmas, we can control the largest eigenvalue of Σ
Lemma 8.5 (Control of
Proof. Let . According to Lemma 8.3,
. We deduce that
Now let us use Lemma 8.4 with this t: there is an event Ω, which has probability greater than 1 such that
. According to Weyl’s inequality (Weyl, 1912), on this event,
In particular,
We can now state and prove the main result of this section, controlling the operator norm of Σ with high probability.
Proposition 8.1 (Control of
Remark 8.2. Proposition 8.1 is the key tool to invert Eq. (5.1) and gain precise control over . In the regime that we consider, the dimension d as well as the number of bins p are fixed, and
, and
are essentially numerical constants. We did not optimize these constant with respect to d, since the main message is to consider the behavior for a large number of new examples (
Proof. We notice that, assuming that is invertible,
is sub-multiplicative, we just have to control each term individually. Lemma 8.3 gives us
Finally, set . It is easy to check that
. Thus we can use Lemma 8.5: with probability greater than 1
By the union bound, with probability greater than 1
In this section, we state and prove the results in relation to . We begin with the computation of Γ, the expected value of
Lemma 9.1 (Computation of Γ). Under Assumption 2 and 1, the expected value of
is given by
where the s are defined by
Proof. Given the expression of , we have essentially two computations to manage:
Computation of Under Assumption 1, by linearity of the integral,
Now we have already seen in the proof of Lemma 8.1 that . Thus we can focus on the computation of
. Under Assumption 2, we have
A straightforward application of Lemmas 11.1 and 11.2 yields
Back to Eq. (9.1), we have shown that
Computation of Under Assumption 1, by linearity of the integral,
We have already computed in the proof of Lemma 8.1 and found that
Regarding the computation of , there are essentially two cases to consider depending whether
not. Let us first consider the case
. Then we obtain
According to Lemma 11.2 and the definition of (Eqs. (7.3) and (7.3)), we have
Now if , by independence,
splits in three parts:
In definitive, plugging these results into Eq. (9.2) gives
As a consequence of Lemma 9.1, we can control
Lemma 9.2 (Control of Under Assumptions 2 and 1, it holds that
Proof. According to Lemma 9.1, we have
Successively using
which concludes the proof.
Proof. Since the are Gaussian with variance
(Assumption 2), the random variable
is Gaussian with variance
, and the
are sub-Gaussian with parameter
. They are also independent, thus we can apply Theorem 11.2 to the
Furthermore, the -valued. Thus the random variables
are also sub-Gaussian with parameter
. We use Hoeffding’s inequality (Theorem 11.2) again, to obtain, for any j,
By the union bound,
We deduce the result since
In this section, we state and prove our main result, Theorem 10.1. It is a more precise version than Theorem 3.1 in the main paper.
Theorem 10.1 (Concentration of
with probability greater than 1
and then to control these two terms using the results of Section 8 and 9.
Control ofWe use the upper bound
. We then achieve control of the operator norm of the empirical covariance matrix in probability with Lemma 8.5, and control of the norm of
Γ in probability with Lemma 9.3. Set
According to Lemma 8.5, for any , there is an event Ω
which has probability greater than 1
such that
According to Lemma 9.3, for any , there exists an event Ω
which has probability greater than 1
4
on that event. One can check that
by definition of
Control ofWe use the upper bound
. We then achieve control of
in probability with Proposition 8.1, whereas we can bound the norm of Γ almost surely with Lemma 9.2. If
0, then there is nothing to prove. Otherwise, set
According to Proposition 8.1, for any , there is an event Ω
which has probability greater than 1
on this event. With the help of Lemma 9.2, one can check that
Conclusion. Set . Define Ω
, where the Ω
are defined as before.
According to the previous reasoning, on the event Ω
Moreover, the union bound gives . We conclude by noticing that
is always smaller than
, thus we just have to require
, as in the statement of our result.
11.1 Gaussian integrals
In this section, we collect some Gaussian integral computations that are needed in our derivations. We provide succinct proof, since essentially any modern computer algebra system will provide these formulas. Our first result is for zero-th order Gaussian integral.
Lemma 11.1 (Gaussian integral, 0-th order). Let be real numbers, and
be positive real numbers. Then, it holds that
Remark 11.1. We often replace by the more readable
in the main text of the paper.
Since f is assumed to be linear in most of the paper, we need first order computations as well:
Lemma 11.2 (Gaussian integral, 1st order). Let be real numbers, and
be positive numbers. Then it
holds that
Finally we want to mention the following result.
Lemma 11.3 (Gaussian integral, 2nd order). Let be real numbers, and
be positive real numbers.
Then, it holds that
Remark 11.2. As a consequence of Lemma 11.3, it would be possible to further our analysis by adding second degree terms to f. Indeed, quantities depending on , which would have to be computed to extend the proofs of Lemmas 9.1 and 9.3, can be computed with this lemma. For instance, one can show that
11.2 Concentration results
In this section we collect some concentration results used throughout our proofs. Note that we use rather use the two-sided version of these results.
Theorem 11.1 (Hoeffding’s inequality). Let be independent random variables such that
its values in
almost surely for all
. Then for every
Proof. This is Theorem 2.8 in Boucheron et al. (2013) in our notation.
Theorem 11.2 (Hoeffding’s inequality for sub-Gaussian random variables). Let be independent random variables such that
is sub-Gaussian with parameter
. Then, for every
Proof. This is Proposition 2.1 in Wainwright (2019).