Statistical stability indices for LIME: obtaining reliable explanations for Machine Learning models

2020·Arxiv

ABSTRACT

ABSTRACT

Nowadays we are witnessing a transformation of the business processes towards a more computation driven approach. The ever increasing usage of Machine Learning techniques is the clearest example of such trend. This sort of revolution is often providing advantages, such as an increase in prediction accuracy and a reduced time to obtain the results. However, these methods present a major drawback: it is very difficult to understand on what grounds the algorithm took the decision. To address this issue we consider the LIME method. We give a general background on LIME then, we focus on the stability issue: employing the method repeated times, under the same conditions, may yield to different explanations. Two complementary indices are proposed, to measure LIME stability. It is important for the practitioner to be aware of the issue, as well as to have a tool for spotting it. Stability guarantees LIME explanations to be reliable therefore a stability assessment, made through the proposed indices, is crucial. As a case study, we apply both Machine Learning and classical statistical techniques to Credit Risk data. We test LIME on the Machine Learning algorithm and check its stability. Eventually, we examine the goodness of the explanations returned.

1. Introduction

Nowadays, more and more interest is devoted to the concept of ”learning from the data”, i.e. using the data collected about the process to predict its outcome (Hastie, Tibshirani, & Friedman, 2009). The main ingredients of its recent success are the huge availability of data sources and the increased computational power, which allows complex algorithms to deliver results in a relatively short time.

In statistics, making predictions about the future is a particularly relevant topic. To address the subject, simple algorithms and methods have been developed over the years, the most famous being Linear Regression and Generalised Linear Models (Greene, 2003). However, with the advent of powerful computing tools, more sophisticated techniques have been developed. In particular, Machine Learning models are able to perform intelligent tasks usually done by humans, supporting the automation of data driven processes.

Despite the enhanced accuracy, Machine Learning models display weakness especially when it comes to interpretability, i.e. “the ability to explain or to present the results, in understandable terms, to a human” (Hall & Gill, 2018). They usually adopt large model structures and refine the prediction using a huge number of iterations. The logic underlying the model ends up hidden under potentially many strata of mathematical calculations, as well as scattered across a too vast architecture, preventing humans from grasping it.

To achieve the interpretability, quite a few techniques have been proposed in recent literature. These approaches can be grouped based on different criteria Molnar (2020a), Guidotti et al. (2018) such as i) Model agnostic or model specific ii) Local, global or example based iii) Intrinsic or post-hoc iv) Perturbation or saliency based. Herein, we focus on LIME (Local Interpretable Model-agnostic Explanations), a local interpretability framework, developed by Ribeiro, Singh, & Guestrin, 2016.

The technique may suffer from a lack of stability, namely repeated applications of the method under the same conditions may obtain different results. This is a particularly delicate issue however it is rarely taken into consideration. Even worse, many times the issue is not spotted at all, e.g. when just a single call to the method is done and the result is considered to be okay without further checks.

In this paper, we introduce a pair of complementary stability indices, useful to measure LIME stability and spot potential issues. They represent an innovative contribution to the scientific community, addressing an important research question.

The indices are calculated on repeated calls of the method, to evaluate the similarity of the results. They may be applied on every trained LIME method and will allow the practitioner to be aware about potential instability of the results, otherwise to ensure that the trained method is consistent.

Hereafter, a brief introduction on the explainability techniques is presented in Chapter 2. The LIME technique is exhaustively analysed in Chapter 4, including its weak points. A thorough discussion about LIME stability can be found in Chapter 4, along with a description of some recent works tackling the issue. Our proposition is extensively discussed in Chapter 5. Eventually, a practical application of the method in the Credit Risk Modelling field is shown in Chapter 6. Chapter 7 is dedicated to Discussion and Conclusions.

2. Related Work

Explainable methods are grouped into Global and Local Explainability techniques (Guidotti et al., 2018). Global methods aim to give an understanding of the model as a whole: the explanation should apply to all the records in the dataset. Local methods instead, attempt to provide very good understanding just for a small portion of records.

In the following review we consider both global and local techniques developed in a model agnostic fashion, so as to be effective on any kind of ML model by construction.

A popular approach is to exclude a certain feature, or group of features, from the model and evaluate the loss incurred in terms of model goodness. The idea has been first introduced by Breiman (Breiman, 2001) for the Random Forest model and has been generalised to a model-agnostic framework, named LOCO (Lei, G’Sell, Rinaldo, Tibshirani, & Wasserman, 2018). Based on variable exclusion, the predictive power of the ML models has been decomposed into single variables contribution in PDP (Friedman, 2001), ICE (Goldstein, Kapelner, Bleich, & Pitkin, 2015) and ALE (Apley & Zhu, 2016) plots, based on different assumptions about the ML model. The same idea is exploited also for local explanations in SHAP (Lundberg & Lee, 2017), where the decomposition is obtained through a game-based setting. These methods’ goal is a fair measure of feature importance. They usually suffer correlation among features since it introduces distortion in the results, while the changes proposed to tackle the correlation have stripped the techniques of some theoretical properties.

Another common approach is to train a surrogate model mimicking the behaviour of the ML model. In this vein, approximations on the entire input space are provided in (Craven & Shavlik, 1996) and (Zhou & Hooker, 2016) among others, while LIME (Ribeiro et al., 2016) and its extension using decision rules (Ribeiro, Singh, & Guestrin, 2018) rely on this technique for providing local approximations. Surrogate models have the nice perk of exploiting a prediction model, this allows to make some sort of what-if analysis (eg. If I were to earn e5000 more a year, how many points would I gain on my credit score?), which is not possible for feature attribution methods. Although one should pay attention to their limitations: global techniques are usually a coarse estimate of the ML model, so the what-if analysis can be quite approximate; on the contrary local methods provide good approximation but just for a small region of the input variables, this means the scenario we test should comprise just small changes.

3. LIME

LIME (Ribeiro et al., 2016) is a method for explaining black-box models, i.e. models whose inner logic is hidden and not clearly understandable.

It provides a number of explainable models which closely resemble the original model behaviour. Each model is specific for an input point x: only in its neighbourhood the explainable model’s predictions are guaranteed to be very close to the black-box ones. This peculiarity places LIME among the Local Explainability tools.

In the following we will focus specifically on LIME for tabular data, since they represent the vast majority of data sources in the Credit Scoring field.

3.1. General Idea

LIME aims to approximate the black-box model f with a simple function g around the point of interest x. g is required to lie into the class of explainable models G.

where P is the number of features employed by the black-box model, to make predictions about the response variable. The explainable model g uses only p of the original P variables, in order to reduce the complexity.

Solving the following optimisation problem, we obtain the function g most similar to f in the neighbourhood of x.

Chosen a given individual x, LIME returns a local explainable model g, which in turn provides the most important variables to predict the points in the x neighbourhood (see Figure 1).

Figure 1. LIME’s modus operandi.Courtesy of Ribeiro et al., 2016

3.2. LIME Algorithm in detail

LIME relies on producing new points, generated from a multivariate distribution of the features in the dataset. The features are considered to follow a Normal distribution, whose parameters are inferred from the dataset. For the purpose of data generation, each feature is assumed to be independent from the others. The points are generated all over the space of the dataset variables.

In order to account for locality, LIME weights each new point using a Gaussian Kernel. Its purpose is to assign a weight to each point, based on its distance from the individual to be explained.

Next step is to query the black-box model and obtain the predicted values for the new points. Doing so, we end up with a brand new dataset.

Such dataset undergoes a rescaling process, which standardises each X feature, leaving untouched the response variable. This allows to compare the contribution of each feature on a similar scale, namely the standard deviation of each variable.

In order to obtain a human-understandable linear model, it is mandatory to use only a bunch of features. Therefore LIME performs also a feature selection step, done usually with Lasso technique. This allows the explanations to be compact and human readable. The number of variables to be retained, namely p, is decided by the practitioner.

On the standardised p-dimensional dataset, LIME performs Ridge Regression, i.e. a Linear Regression combined with a penalty related to the 2 norm of the coefficients (Hoerl & Kennard, 1970), used to prevent overfitting. The model training is done in a weighted fashion: each point contributes to the model according to its weight

The result is a linear model, which provides understanding of the process through its coefficients: the higher the coefficient, the bigger the variation in the value of the response variable when the feature is changed. The sign of the coefficient tells us the direction of the variation, namely if we will face a decrease or an increase of the output value.

3.3. LIME Drawbacks

LIME is sensitive to the dataset dimensionality: when it is employed to interpret a Machine Learning model built using a huge number of variables, the local explanation is unable to discriminate among relevant and irrelevant features.

This phenomenon is due to the weighting kernel. Generally speaking, it can be considered as a similarity (or distance) function, thus it inherits the drawbacks of this class. As thoroughly described by Beyer, Goldstein, Ramakrishnan, & Shaft, 1999, in high dimensional datasets, chosen a fixed point, the distance to its nearest data point approaches the distance to the farthest one, as dimensionality increases. LIME applies the kernel function before variable reduction, thus for high dimensional datasets, the kernel is not able to distinguish between near and distant points, considering all of them approximately at the same distance. This results in a loss of the locality concept and consequently in a bad performance of the algorithm.

Such occurrence is intuitively shown in Figure 2. In it, the Credit Scoring dataset used in Section 6 has been employed to train a Gradient Boosting Tree model using 100 variables. Although the Gradient Boosting has shown good performance in such a setting, LIME applied to the model has not been able to discriminate among important and irrelevant regressors. In particular, many features exhibit low values and almost all of them are equally important.

This weakness curbs LIME’s employment on black-box models handling high dimensional datasets. To date, it is a practitioner duty to ensure the dataset dimensionality is low enough for LIME to work well. This usually requires feature selection, upstream of data modelling.

4. LIME Stability issue

Consider choosing a specific individual and performing LIME on it, several times. Indeed, it is desirable to obtain the same explanations from each call.

Every time LIME is employed, it generates new data points, which follow the same distribution (law) but are different among distinct applications. This is due to the random nature of the sampling. Using different points it may happen to obtain divergent explainable models g, thus different explanations, for the chosen individual.

Based on this evidence, we define the concept of LIME stability: explanations derived from repeated LIME calls, under the same conditions, are considered stable when

Figure 2. LIME explanations are not informative when applied to Machine Learning models with many independent variables, in this case Gradient Boosting model using 100 features.

statistically equal.

In Alvarez-Melis and Jaakkola (2018) the authors provide insight about LIME’s lack of robustness, a similar notion to the above-mentioned stability. Analogous findings also in Gosiewska and Biecek (2019).

Some approaches, grouped in two high level concepts, have been recently laid out in order to solve the stability issue.

Avoid the sampling step

In Zafar and Khan (2019) the authors propose to bypass the sampling step using the training units only and a combination of Hierarchical Clustering and K-Nearest Neighbour techniques. Although this method achieves stability, it may find a bad approximation of the ML function, in regions with only few training points.

Evaluate the post-hoc stability

The shared idea is to repeat LIME method at the same conditions, and test whether the results are equivalent. Among the various propositions on how to conduct the test, in Shankaranarayana and Runje (2019) the authors compare the standard deviations of the Ridge coefficients, whereas Molnar (2020b) examines the stability of the feature selection step - whether the selected variables are the same - .

Although we consider recent work on the topic headed in the right direction, we feel more work has to be done in order to provide solid grounds and mathematical rigour to the metrics evaluating LIME stability.

5. Our Proposition

Hereafter, a formal description of the framework considered, in order to evaluate LIME stability.

Consider the black-box model f composed by P variables, a chosen point to be explained x and a fixed number of variables p, used in LIME’s explainable models. The Weighted Ridge Regression model g (LIME’s output) can be viewed as a mapping function between the set of variables and the respective coefficients.

where F is the set of variables, of cardinality P. p out of P features will be associated with a value different from 0, the others variables will have 0 coefficient, meaning they are irrelevant to the model. The formulation g(feat) indicates the coefficient value of the feature named feat, in the model g.

We perform m different calls to LIME on the model f and the individual x, obtaining m different explainable models

We want to: (i) check whether different g are composed by the same variables, (ii) compare the coefficients of the same variable among and test whether they can be considered equal.

To this purpose, we devise two complementary indices: the Variables Stability Index (VSI) and Coefficients Stability Index (CSI).

5.1. Variables Stability Index: VSI

The Variables Stability Index (VSI), whose steps are explained in Algorithm 1, addresses the first point, namely it compares the variables composition of the models.

We consider the set ) of all possible combinations of the m explainable models, two by two. The generic element of ) is the pair (

where represent respectively the variables used in the explainable models function returns an integer value, namely the cardinality of the intersection between , ranging from 0 to p. It represents the number of variables used by both

We evaluate the concordance over all the pairs in ) and we average them, obtaining the VSI index, ranging from 0 to 1. We express the index as a percentage: it now spans from 0 to 100, the more it approaches 100 the more the variables found in different applications are the same.

5.2. Coefficients Stability Index: CSI

The equality between coefficients of the different models is now under investigation. In the following, we derive the statistical distribution of the coefficients and we rely on it, to create confidence intervals and possibly statistical tests.

It is a well-known result (Greene, 2003), that under the classic assumptions of Linear Regression, the coefficients are guaranteed to follow a Gaussian distribution. This is not sufficient, since we deal with Weighted Ridge Regression.

In van Wieringen (2015), the distribution of the Ridge Regression estimator is given by the formula:

where X is the matrix of observations. In our setting, X is composed by the points randomly sampled inside LIME. The matrix stands for the identity matrix (dimensions is the variance of the random variables describing the errors per each sampled point. Under the Regression assumptions the errors are independent and identically distributed (IID) following a Gaussian law: for the Ridge regularisation coefficient. The vector represents the true values of the coefficients in population, whereas ˆconsists in the estimates of the true values, using the X dataset.

In our setting, we may consider the values as the unknown coefficients of the best linear approximation of f in the neighbourhood of x. LIME aims to provide ˆas much as possible to the unknown

Concerning Weighted Regression, it is usually estimated via Generalised Least Squares (GLS) which guarantee the distribution of its estimators to be the following (see Johnston & DiNardo, 1972) :

In the formula, diagonal matrix of weights per each unit. In our setting, the W matrix is populated by the kernel weights calculated on the distance of each sampled point from x.

It is important to recall that is an unknown value and we are requested to obtain an unbiased estimator inferred from the data. Such estimator takes the form:

for the Weighted Regression, as stated in Johnston and DiNardo (1972). E stands for the vector of the errors per each sampled point: ). As far as Ridge Regression is concerned, the variance estimator remains unchanged from the Linear Regression’s one (van Wieringen, 2015).

Using the building blocks stated before, we derive the distribution of the Weighted Ridge Regression estimator. Starting from the Ridge Regression law (Equation 3), we know (Billingsley, 2008) that the Gaussian distribution is invariant whenever we employ a matrix of known weights. This guarantee the Weighted Ridge law of the coefficients to be Gaussian. Its distribution is

We provide also the formula for the variance estimator of the Weighted Ridge Regression

where n is the number of data points sampled inside LIME, p denotes the number of variables considered in the explainable model.

Knowing the distribution of the coefficients, we might derive a test statistic to assess a null hypothesis of equality. This comparison can be carried out also among coefficients of two different regression models, as long as they were estimated on two independent samples drawn from the same law, as derived by Brame, Paternoster, Mazerolle, & Piquero, 1998 for the coefficients of two distinct Linear Regressions.

This assumption holds true in our experimental design, since the data are sampled from the features’ distribution inferred from the original data. It means that the true generating distribution of , i.e. the datasets sampled in repeated LIME calls, is identical, while the differences among them are attributable only to the sampling variance.

Unfortunately, the simplifications carried out in Brame et al., 1998 and Greene, 2003 in order to derive the t-test statistic, rely on the equality of the expected value of the two coefficients taken into consideration. This is true in Linear Regression, but the framework brakes down using a regularisation technique such as Ridge: the regulariser trades off the unbiasedness of the estimator in exchange for a possibly strong reduction of the variance.

Since the estimator is not unbiased any more, the expected value is now depending on the design matrix X. As stated before, different LIME calls give rise to different design matrices, this implies the expected values of a specific variable, taken from two different explainable models , to be different: E[This result causes the derivation of the t-test statistic to break down.

Testing the null hypothesis of equality has proven tricky and not easily solvable, hence we rely on the Gaussian distribution of the coefficients to construct 95% con-fidence intervals. To do that, we design the function ConfInt, taking as input a coefficient and giving back its confidence interval:

where Var(g(feat)) is calculated based on the distribution given in Equation 5.

We may consider the parameters to be different, within a 5% error rate, when the confidence intervals are not overlapped at all. Instead, we consider them to be stable whenever the confidence intervals overlap to some extent.

To this purpose, we devise the binary function overlap, which takes as input a generic pair of confidence intervals ) and returns either value 1 or 0, based on the overlap presence.

The comparison among confidence intervals is carried out separately for each variable. Chosen a certain feature, we check through the explainable models if the feature is relevant (coefficient different from 0). Whenever this happens, we build the confidence interval for the coefficient, using the function ConfInt, and we consider the set of all confidence intervals, namely M, for the chosen variable.

We create all the possible combinations of the M items, two by two. This results in the set , whose generic element is ). We calculate the overlap between the two intervals, using the overlap function, for all the pairs in

The outcome is a count variable, which we normalise dividing by the cardinality of the set . The value obtained ranges from 0 to 1 and it is called the Partial Index (Par) for the variable considered. It represents a measure of concordance of the specific variable’s coefficients among different LIME calls.

To achieve a general concordance metric, we average the Partial Indices of all the features and obtain the Coefficients Stability Index (CSI), ranging from 0 to 1. Consider now the index as a percentage, rescaling it from 0 to 100: the more CSI approaches 100, the more LIME coefficients may be considered stable in the neighbourhood of the chosen individual.

CSI steps are detailed in Algorithm 2.

5.3. Interpretation of the indices

The previously defined indices constitute a useful tool for assessing LIME stability in practical scenarios. By construction, VSI measures the concordance of the variables retrieved, whereas CSI tests the similarity among coefficients for the same variable, in repeated LIME calls.

Both of them range from 0 to 100. High VSI values guarantee the variables retrieved in different LIME are almost always the same. On the contrary, low values testify explanations are not trustworthy: we may retrieve completely different variables explaining the same Machine Learning decision, according to different LIME calls.

As far as CSI is concerned, high values ensure LIME coefficient for each feature is reliable. Low values, instead, induce the practitioner to be very cautious: given a feature, the first LIME call will give back a certain value of the coefficient, but the one after is likely to retrieve a different value. Since the coefficient represents the impact of the feature on the Machine Learning decision, obtaining different values correspond to very different explanations.

Each index has a proper meaning and checks for a particular stability instance. Achieving high values for both of them ensures stability, however low values for only one metric are still possible. Keeping the measurements separated allows for understanding which one of the two complementary definitions of stability has been violated by the trained LIME method.

6. Practical Application to Credit Risk Data

Credit Risk Modelling (CRM) consists in estimating the probability that a debtor will not repay the due amount. This task is regarded as a fully-fledged prediction task and as such, a variety of different learning techniques have been applied over the years in order to solve it.

Since long, statistical approaches have been exploited, forming the core techniques in this field. Recently also Machine Learning models have been given a chance. They have usually shown an increase in the prediction power, although they do not provide reliable explanations for the scores they come up with.

This is a particularly delicate issue in CRM, since it is a highly regulated field: GDPR (Kingston, 2017), as well as the ”Ethical Guidelines for trustworthy AI” (HLEG, 2019) and the Report from the ”European Banking Authority” (EBA, 2020) testify the care dedicated to such topics by the European Community.

In order to exploit the Machine Learning potential in the Credit Scoring field, it is mandatory to address the interpretability issue. To do that, our proposal concerns applying LIME on top of a well performing black-box algorithm. By doing so, we wish to retain the increased predictive power of Machine Learning, while providing meaningful explanations to the applicants involved, as well as to the regulator.

To validate the above mentioned approach, we use a real-life dataset representative of a loan application process. It comes from an anonymised statistical sample, obtained by pooling data from several Italian financial institutions.

In it, there are several demographic, economic and financial variables used as predictors, whereas the response variable consists of only two categories (bad payer, good payer), framing the problem as classification.

The dataset composition is shown in the Table 1.

Table 1. Dataset Composition

Before using learning techniques of any kind, we take care of selecting the important features among the many variables available in the dataset. This is done for two reasons: (i) classical models are not performing well in high dimensional settings and (ii) LIME applied to high dimensional Machine Learning models would cause the method to fail. To this end, we retained only the most important 20 features, to be employed for learning purposes.

6.1. Logistic Regression model

On the dataset described above, we employ Logistic Regression along with a Machine Learning model, for the sake of comparison among the two techniques.

We consider Logistic Regression as the benchmark model, being explainable and widely used in Credit Scoring. The technique is well described in Agresti, 2015 and it usually obtains satisfactorily results on CRM datasets. Logistic Regression provides also interpretability out-of-the-box: the parameters derived from the best curve’s estimation, can be regarded as odds ratio, i.e. the ratio between the probability of default and non-default, namely

6.2. Machine Learning Model

About the choice of Machine Learning algorithms, we use tree-based Machine Learning models, specifically Gradient Boosting Trees (see Figure 3). They retain the enhanced predictive power of Machine Learning models, while having the additional advantage of requiring almost none pre-processing. Because of their structure, they are able to cope with outliers and extreme values easily.

In (Visani, Chesani, Bagli, Capuzzo, & Poluzzi, 2019) a comparison between Logistic Regression and Gradient Boosting Trees applied to CRM, along with an explanation for the increased predictive power of the latter one.

Figure 3. Gradient Boosting Tree model.

) is the best Tree built at step k, its parameters Θare chosen in order to minimise the Loss Function between the target variable Y and the Boosted model of the previous step.

The parameter is the weight of the Tree, when added in the Boosted Ensemble, this is also chosen with respect to the Loss Function.

6.3. Model comparison

The dataset has been divided into Train and Test set, the former is employed to train the models whereas the latter serves for comparison purposes. Gradient Boosting’s hyperparameters have been tuned by means of grid search and 10-fold cross validation on the training set. On the contrary, Logistic Regression has no hyperparameters to tune and the fit has been carried out on the entire training set, exploiting the NewtonRhapson iterative procedure for convergence of the coefficients.

Performance comparison across the two models is done by means of the Gini Index on the Test set, considered the most reliable figure of merit of the model performances in CRM field (Hand, 2001). In Figure 4, the Lorentz Curve of the two models is displayed, along with the Gini values.

Using Gradient Boosting instead of Logistic Regression, it is possible to recognise an improvement in performance, testified by a Gini increase of more than 3 points. This result is consistent with the recent comparison between Statistical and Machine Learning models applied to the CRM field, carried out by (Moscatelli, Narizzano, Parlapiano, & Viggiano, 2019) at Banca d’Italia.

6.4. LIME applied to Credit Risk models

We test LIME on several data points, with the purpose of understanding the logic hidden into the Gradient Boosting model employed. In Table 2, we report LIME explanations for a “good” user, which has been correctly predicted by the GBM model. Different LIME settings are employed, to demonstrate how a wrong choice of the

It is worth noticing LIME results for the stable explanation make sense from an economic and financial standpoint: the key regressors are the Credit Bureau Score (CBS), namely a comprehensive value developed using information provided by the Italian Credit Bureau, and the number of months where unpaid instalments occurred, within the last year. The user exhibits 0 months with unpaid instalments and falls inside a good class of CBS index. Such circumstances are the major ones leading Gradient Boosting model to classify him as a good payer.

On the contrary, the unstable LIME method produces different regression lines for each call, making it very hard to trust them since for the same individual we end up with totally different explanations.

On a 4 Intel-i7 CPUs 2.90GHz laptop, the indices took 10.23 and 11.54 seconds to be calculated for the left and right settings of the Table 2 respectively.

Table 2. LIME applied to Gradient Boosting model.

The sum of the bars’ values, along with the intercept, produces the Local Ridge model prediction (denoted as LIME Prediction). The bars’ length highlight the specific contribution of each variable: the green ones push the model towards ”good payer” prediction, whereas the red ones to ”bad payer”.

7. Discussion and conclusions

Often Machine Learning models produce more accurate predictions compared to classical models: there was evidence of such trend also in the use case presented above, regarding the CRM field. Credit Risk would therefore benefit from the employment of such more powerful techniques.

Regrettably, to date there is no methodology allowing unambiguous explanations of Machine Learning models: in recent years, a number of methods have been proposed, but their consistency and reliability is still a discussion topic.

We focus on the LIME technique and apply it successfully to Credit Risk data. Digging further into the method, we try to establish whether LIME is stable, namely if repeated calls of the method, on the same individual, result in very close explanations.

To this end, we derive the distribution of the local model coefficients, used by LIME under the hood. Building on top of the coefficients’ distribution, we create a stability index, which evaluates whether the coefficients for the same variable among different LIME calls are similar (CSI). Meanwhile, we monitor whether the variables returned by different LIME calls are the same. This is done using another index: VSI. The two complementary indices both range from 0 to 100, where higher values correspond to an higher degree of stability.

An application to tabular data is shown in the paper, although the stability framework can be applied also to images and text data, as long as LIME local model is chosen to be Ridge Regression. In fact, the indices formulation relies on the Ridge model properties.

When used together, they provide useful insights to the practitioner about the consistency of the trained LIME method: they help understand whether LIME is likely to modify its output at the next call.

We consider it an important step: it improves the trust in LIME as a reliable explanation method and it goes towards meeting the regulator’s requests in the CRM field.

However, such result just ensures LIME is concordant among different applications: the model may still return explanations not really close to the Machine Learning model. More research still needs to be done in this direction.

Acknowledgements

We would like to thank professor Giuliano Galimberti who provided insight and expertise that greatly assisted the research, although he may not agree with all of the interpretations/conclusions of this paper.

Funding

We acknowledge financial support by CRIF S.p.A. and Universit`a degli Studi di Bologna.

References

Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons.

Alvarez-Melis, D., & Jaakkola, T. S. (2018). On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049.

Apley, D. W., & Zhu, J. (2016). Visualizing the effects of predictor variables in black box supervised learning models. arXiv preprint arXiv:1612.08468.

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In International conference on database theory (pp. 217–235). Springer.

Billingsley, P. (2008). Probability and measure. John Wiley & Sons.

Brame, R., Paternoster, R., Mazerolle, P., & Piquero, A. (1998). Testing for the equality of maximum-likelihood regression coefficients between two independent equations. Journal of Quantitative Criminology, 14(3), 245–261.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.

Craven, M., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In Advances in neural information processing systems (pp. 24–30).

EBA, E. B. A. (2020). Report on Big Data and Advanced Analytics.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of statistics, 1189–1232.

Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1), 44–65.

Gosiewska, A., & Biecek, P. (2019). IBreakDown: Uncertainty of model explanations for non-additive predictive models. arXiv preprint arXiv:1903.11420.

Greene, W. H. (2003). Econometric analysis. Pearson Education India.

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5), 93.

Hall, P., & Gill, N. (2018). An Introduction to Machine Learning Interpretability-Dataiku Version. O’Reilly Media, Incorporated.

Hand, D. J. (2001). Modelling consumer credit risk. IMA Journal of Management mathematics, 12(2), 139–155.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media.

HLEG, A. (2019). Ethics guidelines for trustworthy AI.

Hoerl, A. E., & Kennard, R. W. (1970, February). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67.

Johnston, J., & DiNardo, J. (1972). Econometric methods (Vol. 2). New York.

Kingston, J. (2017, December). Using artificial intelligence to support compliance with the general data protection regulation. Artificial Intelligence and Law, 25(4), 429–443.

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution- free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111.

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765–4774).

Molnar, C. (2020a). Interpretable machine learning. Lulu. com.

Molnar, C. (2020b). Limitations of Interpretable Machine Learning Methods.

Moscatelli, M., Narizzano, S., Parlapiano, F., & Viggiano, G. (2019). Corporate default forecasting with machine learning.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144). ACM.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence.

Shankaranarayana, S. M., & Runje, D. (2019). ALIME: Autoencoder Based Approach for Local Interpretability. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 454–463). Springer.

van Wieringen, W. N. (2015). Lecture notes on ridge regression. arXiv preprint arXiv:1509.09169.

Visani, G., Bagli, E., & Chesani, F. (2020). OptiLIME: Optimized LIME Explanations for Diagnostic Computer Algorithms. arXiv preprint arXiv:2006.05714.

Visani, G., Chesani, F., Bagli, E., Capuzzo, D., & Poluzzi, A. (2019). Explanations of Machine Learning predictions: A mandatory step for its application to Operational Processes.

Zafar, M. R., & Khan, N. M. (2019). DLIME: A deterministic local interpretable model- agnostic explanations approach for computer-aided diagnosis systems. arXiv preprint arXiv:1906.10263.

Zhou, Y., & Hooker, G. (2016). Interpreting models via single tree approximation. arXiv preprint arXiv:1610.09036.

designed for accessibility and to further open science