Unbiased variable importance for random forests

2020·Arxiv

ABSTRACT

ABSTRACT

The default variable-importance measure in random Forests, Gini importance, has been shown to suffer from the bias of the underlying Gini-gain splitting criterion. While the alternative permutation importance is generally accepted as a reliable measure of variable importance, it is also computationally demanding and suffers from other shortcomings. We propose a simple solution to the misleading/untrustworthy Gini importance which can be viewed as an overfitting problem: we compute the loss reduction on the out-of-bag instead of the in-bag training samples.

Variable importance is not very well defined as a concept. Even for the case of a linear model with n observations, p variables and the standard n >> p situation, there is no theoretically defined variable importance metric in the sense of a parametric quantity that a variable importance estimator should try to estimate (Gr¨omping, 2009). Variable importance measures for random forests have been receiving increased attention in bioinformatics, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. They also have been used as screening tools (D´ıaz-Uriarte and De Andres, 2006, Menze et al., 2009) in important applications highlighting the need for reliable and well-understood feature importance measures.

The default choice in most software implementations (Liaw and Wiener, 2002, Pedregosa

et al., 2011) of random forests (Breiman, 2001) is the mean decrease in impurity (MDI). The MDI of a feature is computed as a (weighted) mean of the individual trees’ improvement in the splitting criterion produced by each variable. A substantial shortcoming of this default measure is its evaluation on the in-bag samples which can lead to severe overfitting (Kim and Loh, 2001). It was also pointed out by Strobl et al. (2007a) that the variable importance

measures of Breiman’s original Random Forest method ... are not reliable in situations where

potential predictor variables vary in their scale of measurement or their number of categories.

There have been multiple attempts at correcting the well understood bias of the Gini impurity

measure both as a split criterion as well as a contributor to importance scores, each one coming from a different perspective.

Strobl et al. (2007b) derive the exact distribution of the maximally selected Gini gain along with their resulting p-values by means of a combinatorial approach. Shih and Tsai (2004) suggest a solution to the bias for the case of regression trees as well as binary classification trees (Shih, 2004) which is also based on p-values. Several authors (Loh and Shih, 1997, Hothorn et al., 2006) argue that the criterion for split variable and split point selection should be separated.

An idea that is gaining quite a bit of momentum is to add so-called pseudo variables to a dataset, which are permuted versions of the original variables and can be used to correct for bias (Sandri and Zuccolotto, 2008). Recently, a modified version of the Gini importance called Actual Impurity Reduction (AIR) was proposed Nembrini et al. (2018) that is faster than the original method proposed by Sandri and Zuccolotto with almost no overhead over the creation of the original RFs and available in the R package ranger (Wright and Ziegler, 2015, Wright et al., 2017). After submission of this article, the following two papers using OOB samples to compute a debiased version of the Gini importance (Li et al., 2019, Zhou and Hooker, 2019) came to the authors’ atttenion.

We use the well known titanic data set to illustrate the perils of putting too much faith into the Gini importance which is based entirely on training data - not on OOB samples -and makes no attempt to discount impurity decreases in deep trees that are pretty much

frivolous and will not survive in a validation set.

In the following model we include passengerID as a feature along with the more reasonable Age, Sex and Pclass. Figure 1 below show both measures of variable importance and (maybe?) surprisingly passengerID turns out to be ranked number 3 for the Gini importance (MDI). This troubling result is robust to random shuffling of the ID.

Figure 1: Mean decrease impurity (MDI, left panel) versus permutation importance (MDA,

The permutation based importance (MDA, right panel) is not fooled by the irrelevant ID feature. This is maybe not unexpected as the IDs should bear no predictive power for the out-of-bag samples.

There appears to be broad consenus that random forests rarely suffer from overfitting which plagues many other models. (We define overfitting as choosing a model flexibility which is too high for the data generating process at hand resulting in non-optimal performance on an independent test set.) By averaging many (hundreds) of separately grown deep trees -each of which inevitably overfits the data - one often achieves a favorable balance in the bias variance tradeoff. For similar reasons, the need for careful parameter tuning also seems less essential than in other models. We point out that random forests immunity to overfitting is restricted to the predictions only and not to the default variable importance measure.

2 OOB impurity reduction

As is standard practice in statistical modeling and machine learning optimization, any attempt to generalize performance or other metrics needs to be based on validation data that were not used in the training stage. We propose the mean decrease in loss evaluated on the OOB parts of the data as a simple alternative to the computationally expensive permutation importance. This approach (i) is in close analogy to MDI (which is computed on the training data), (ii) should be easily implemented in most software libraries and (iii) leads to no bias toward variables with high cardinality. As candidates for appropriate loss functions,

we explored log loss, misclassification rate and MDI but eventually adapted a penalized

Gini impurity which combines inbag and out-of-bag samples in a novel way. It is easy to see that just redefining the information gain based on the Gini impurities of the OOB samples

is not a good measure for the quality of a split as it ignores the correlation with the actual labels from the training data. It would be misleading to assign a low impurity to a node with e.g. ˆ9 if the predictions of the training data were drastically different, e.g. ˆ2. In such a case we would not want to boost the contribution of the variable which led to this split.

Our main idea is to increase the impurity I(m) for node m by a penalty that is proportional to the difference ∆ = (ˆ

The exact details of the actual loss functions depend on the following conceptual design choices:

• Symmetry: should one weigh train and test data equally or put more faith into the

• Maximum Uncertainty: should one allow the discrepany ∆ to increase the loss beyond

The answers to both questions also affect the likelihood of negative importance scores which may or may not be desired, an issue that we will reflect upon later. The following four will be evaluated in this paper:

The last two are clearly symmetric in ˆdoes not take into ac- count the inbag impurity of the node. Measures (1) and (2) “over-penalize”, i.e. their maximum values are 1 , while the last measure does not allow the ∆ term to in- crease the impurity beyond 0. We mention in passing that the terms inbag/OOB are synonymous with train/test subsets.

Our R package rfVarImpOOB (Loecher, 2019) computes (1) (3) for the current randomForest library (Liaw and Wiener, 2002). The code is written as an external wrapper and hence slow. While it would be straightforward to parallelize the pass over the individual trees, our hope is that the authors of (Wright and Ziegler, 2015, Wright et al., 2017, Liaw and Wiener, 2002, Pedregosa et al., 2011) would adapt these importance scores into the C code base.

Figure 2 compares the three measures for the same random forest model fitted to the Titanic data. While there is no objective ground truth to compare with, the left 2 panels assign effectively zero importance to PassengerId compared to which penalizes dif- ferences between train/test the least. The negative scores for be interpreted as a measure of extreme overfitting in the tree building process. Note that the valid split contributions of Age are overcompensated by the large number of splits that

Figure 2: Comparison of the three penalized (Gini) mean decrease impurities (PMDI) eqns.

do not hold up on the validation set. If negative importance scores are deemed difficult to communicate, it would be appropriate to truncate them at zero.

2.1 Application to C-to-U conversion data

As a second example, we compare importance scores for the Arabidopsis thaliana data (Cum- mings and Myers, 2004) which was also analyzed in (Strobl et al., 2007a). The sample (n = 876) contains the binary response (edit) and the 40 nucleotides at positions the codon position (cp), the estimated folding energy (fe) and the difference in estimated folding energy between preedited and edited sequences (dfe). Only the latter 2 predictors are continuous, the other 41 variables are categorical with 4 levels each. We add an uninformative predictor (sfe) by randomly shuffling the column fe. Figure 3 shows the standard importances for a random forest model fitted to the Arabidopsis data. Figure 4 compares the three PDMI measures and finds only positions 1 and 1 as strong predictors. while the importance of fe, dfe seem to vary moderately with the exact penalty chosen. The shuffled column sfe is assigned a notable importance score for the default impurity decrease (MDI) on the training data which all but disappears for the alternative measures.

Figure 3: Mean decrease impurity (MDI, left panel) versus permutation importance (MDA,

2.2 Simulated Data

We replicate the simulation design used by (Strobl et al., 2007a) where a binary response variable Y is predicted from a set of 5 predictor variables that vary in their scale of measurement and number of categories. The first predictor variable is con- tinuous, while the other predictor variables are multinomial with 2, 4, 10, 20 categories, respectively. The sample size for all simulation studies was set to n = 120. In the first null case all predictor variables and the response are sampled independently. We woud hope that a reasonable variable importance measure would not prefer any one predictor variable over any other. In the second simulation study, the so-called power case, the distribution of the response is a binomial process with probabilities that depend on the value of

As is evident in the left panel of Figure 5 the Gini importance shows a strong preference for variables with many categories and the continuous variable confirming its well-known

Figure 4: Comparison of the three penalized mean decrease impurities (PMDI) eqns.

bias. For the sake of brevity we omit the boxplot for the permutation importance which was already discussed in (Strobl et al., 2007a) and only compare eqns. (1) and (2). Encouragingly,

Figure 5: Results of the null case, where none of the predictor variables is informative.

both methods yield low scores for all predictors. A recurring pattern of method (1) are the relatively large negative scores that depend on the cardinality of the variables. As mentioned before, it would be straightforward to clip those to zero if one was to derive no value from learning about the degree of overfitting. More worrisome are the notable differences in the variance of the distributions for predictor variables with different scale of measurement or number of categories. The results from the power study are summarized in Figure 6. MDI again shows a strong bias towards variables with many categories and the continuous variable. At the chosen sigal-to-noise ratio it fails to identify the relevant predictor variable. In fact, the mean value for the relevant variable is lowest and only slightly higher than in the null case. Both methods (1) and (2) clearly succeed in identifying as the most relevant feature. While the large fluctuations of the importance scores for and especially are bound to yield moderate “false positive” rates and incorrect rankings in single trials. The negative scores of lead to a larger separation of the signal from noise and hence more reliable correct identification of the most important variables.

Figure 6: Results of the null case, where only is informative. Simulation details as in

2.3 Sample Variance Bias Correction

Recall that the Gini index is often viewed as the variance of a Bernoulli process

At the same time an unbiased estimator of the variance of any data set

which is simply the sample version of the Bernoulli variance. Figure 7 shows the results for the same simulations outlined in section 2.2. (We omitted

underperformance.) Several stark differences to Figures 5 and 6 are immediately obvious: (i)

its unbiased attribute and shows distorted rankings that depend on the number of categories.

2.4 Regression

While we leave the implementation of the regression case for a future version of the package, the idea of replacing the inbag loss (in this case MSE) with the OOB loss remains the same.

Figure 7: Simulation results for the modified . (top row) The null case, where none

The equivalents of Eqns. (1)-(3) will be even more straightforward as the penalty term becomes unnecessary if we replace ¯

3 Discussion

It seems somewhat surprising that the careful distinction between validation and training test has not been drawn as consistently for variable importance measures as for most other metrics in statistical modeling. From a machine learning perspective, one would want to base feature importance scores only on the performance of an appropriate loss function on a validation set. Misclassification rate as a function of ˆ) suffers from a discontinuity at ˆ5. Instead we advocate the modified Gini impurity measures defined in Eqns. (1)-(3), which encode some important differences in philosophy. For perfect agreement, ˆ, all three scores reduce to the conventional Gini impurity with values between [0will reach their maximum of 1 when the disagreement is the highest, ’s philosophy is such that no constellation can yield a loss higher than 0.5 which would reflect complete uncertainty. Hence, e.g. ˆis no more impure than ˆ

The crucial difference between (1) and (2) is symmetry: advocates an average impu- rity between train and test whereas weighs inbag much less directly. In particular, 5 if either ˆ5 and in that case is independent of the respective other proportion. We have further shown that are unbiased for uninformative features. In combination with the poor performance of

Summarizing, we view the well established bias in Gini importance measures as another manifestation of overfitting in machine learning models and propose a straightforward solution to this problem. We have demonstrated its effectiveness on 2 real and one simulated data set. While our solution is clearly amenable to any tree-based method, the bias is most pronounced for deep trees, hence the focus on Random Forests.

References

L. Breiman. Random forests. Machine Learning, 45, 2001. 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.

Michael P Cummings and Daniel S Myers. Simple statistical models predict c-to-u edited sites in plant mitochondrial rna. BMC Bioinformatics, 5(1):132, 2004.

Ram´on D´ıaz-Uriarte and Sara Alvarez De Andres. Gene selection and classification of mi- croarray data using random forest. BMC Bioinformatics, 7(1):3, 2006.

Ulrike Gr¨omping. Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4):308–319, 2009.

Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A con- ditional inference framework. Journal of Computational and Graphical statistics, 15(3): 651–674, 2006.

Hyunjoong Kim and Wei-Yin Loh. Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454):589–604, 2001.

Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased mdi feature importance measure for random forests. arXiv preprint arXiv:1906.10845, 2019.

Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2 (3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.

Markus Loecher. rfVarImpOOB: Unbiased Variable Importance for Random Forests, 2019. URL https://CRAN.R-project.org/package=rfVarImpOOB. R package version 1.0.

Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification trees. Statistica sinica, pages 815–840, 1997.

Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich, and Fred A Hamprecht. A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 10(1):213, 2009.

Stefano Nembrini, Inke R K¨onig, and Marvin N Wright. The revival of the gini importance? Bioinformatics, 34(21):3711–3718, 2018.

Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct): 2825–2830, 2011.

Marco Sandri and Paola Zuccolotto. A bias correction algorithm for the gini variable impor- tance measure in classification trees. Journal of Computational and Graphical Statistics, 17(3):611–628, 2008.

Y-S Shih. A note on split selection bias in classification trees. Computational statistics & data analysis, 45(3):457–466, 2004.

Yu-Shan Shih and Hsin-Wen Tsai. Variable selection bias in regression trees with constant fits. Computational statistics & data analysis, 45(3):595–607, 2004.

C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 2007a. 10.1186/1471-2105-8-25. URL https://doi.org/10.1186/1471-2105-8-25.

Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classification trees based on the gini index. Computational Statistics & Data Analysis, 52 (1):483–501, 2007b.

Marvin N Wright and Andreas Ziegler. ranger: A fast implementation of random forests for high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409, 2015.

Marvin N Wright, Stefan Wager, Philipp Probst, and Maintainer Marvin N Wright. Package ranger. 2017.

Zhengze Zhou and Giles Hooker. Unbiased measurement of feature importance in tree-based methods. arXiv preprint arXiv:1903.05179, 2019.

4 Appendix

4.1 Expected Values

The decrease in impurity (∆G) for a parent node m is the weighted difference between the Gini importance) and those of its left and right children:

We assume that the node m splits on an independent.

We will use the short notation either equal to oob or in and rely on the following facts and notation:

1. is the “population” proportion of the class label in the OOB test

2. is the “population” proportion of the class label in the inbag test data

3.

4.

Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of

4.1.1 E(∆PG(0)oob) ̸= 0

We use the shorter notation

We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.

4.1.2 E(∆�PG

The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction (5), as outlined in the main text. We now show that the bias for

this modified Gini impurity is zero for OOB data. As before,

4.1.3 E(∆PG(2)oob) = 0

We can rewrite as follows:

E[∆PG