As is standard practice in statistical modeling and machine learning optimization, any attempt to generalize performance or other metrics needs to be based on validation data that were not used in the training stage. We propose the mean decrease in loss evaluated on the OOB parts of the data as a simple alternative to the computationally expensive permutation importance. This approach (i) is in close analogy to MDI (which is computed on the training data), (ii) should be easily implemented in most software libraries and (iii) leads to no bias toward variables with high cardinality. As candidates for appropriate loss functions,
we explored log loss, misclassification rate and MDI but eventually adapted a penalized
Gini impurity which combines inbag and out-of-bag samples in a novel way. It is easy to see that just redefining the information gain based on the Gini impurities of the OOB samples
is not a good measure for the quality of a split as it ignores the correlation with the actual labels from the training data. It would be misleading to assign a low impurity to a node with e.g. ˆ9 if the predictions of the training data were drastically different, e.g. ˆ2. In such a case we would not want to boost the contribution of the variable which led to this split.
Our main idea is to increase the impurity I(m) for node m by a penalty that is proportional to the difference ∆ = (ˆ
The exact details of the actual loss functions depend on the following conceptual design choices:
• Symmetry: should one weigh train and test data equally or put more faith into the
• Maximum Uncertainty: should one allow the discrepany ∆ to increase the loss beyond
The answers to both questions also affect the likelihood of negative importance scores which may or may not be desired, an issue that we will reflect upon later. The following four will be evaluated in this paper:
The last two are clearly symmetric in ˆdoes not take into ac- count the inbag impurity of the node. Measures (1) and (2) “over-penalize”, i.e. their maximum values are 1 , while the last measure does not allow the ∆ term to in- crease the impurity beyond 0. We mention in passing that the terms inbag/OOB are synonymous with train/test subsets.
Our R package rfVarImpOOB (Loecher, 2019) computes (1) (3) for the current randomForest library (Liaw and Wiener, 2002). The code is written as an external wrapper and hence slow. While it would be straightforward to parallelize the pass over the individual trees, our hope is that the authors of (Wright and Ziegler, 2015, Wright et al., 2017, Liaw and Wiener, 2002, Pedregosa et al., 2011) would adapt these importance scores into the C code base.
Figure 2 compares the three measures for the same random forest model fitted to the Titanic data. While there is no objective ground truth to compare with, the left 2 panels assign effectively zero importance to PassengerId compared to which penalizes dif- ferences between train/test the least. The negative scores for be interpreted as a measure of extreme overfitting in the tree building process. Note that the valid split contributions of Age are overcompensated by the large number of splits that
Figure 2: Comparison of the three penalized (Gini) mean decrease impurities (PMDI) eqns.
do not hold up on the validation set. If negative importance scores are deemed difficult to communicate, it would be appropriate to truncate them at zero.
2.1 Application to C-to-U conversion data
As a second example, we compare importance scores for the Arabidopsis thaliana data (Cum- mings and Myers, 2004) which was also analyzed in (Strobl et al., 2007a). The sample (n = 876) contains the binary response (edit) and the 40 nucleotides at positions the codon position (cp), the estimated folding energy (fe) and the difference in estimated folding energy between preedited and edited sequences (dfe). Only the latter 2 predictors are continuous, the other 41 variables are categorical with 4 levels each. We add an uninformative predictor (sfe) by randomly shuffling the column fe. Figure 3 shows the standard importances for a random forest model fitted to the Arabidopsis data. Figure 4 compares the three PDMI measures and finds only positions 1 and 1 as strong predictors. while the importance of fe, dfe seem to vary moderately with the exact penalty chosen. The shuffled column sfe is assigned a notable importance score for the default impurity decrease (MDI) on the training data which all but disappears for the alternative measures.
Figure 3: Mean decrease impurity (MDI, left panel) versus permutation importance (MDA,
2.2 Simulated Data
We replicate the simulation design used by (Strobl et al., 2007a) where a binary response variable Y is predicted from a set of 5 predictor variables that vary in their scale of measurement and number of categories. The first predictor variable is con- tinuous, while the other predictor variables are multinomial with 2, 4, 10, 20 categories, respectively. The sample size for all simulation studies was set to n = 120. In the first null case all predictor variables and the response are sampled independently. We woud hope that a reasonable variable importance measure would not prefer any one predictor variable over any other. In the second simulation study, the so-called power case, the distribution of the response is a binomial process with probabilities that depend on the value of
As is evident in the left panel of Figure 5 the Gini importance shows a strong preference for variables with many categories and the continuous variable confirming its well-known
Figure 4: Comparison of the three penalized mean decrease impurities (PMDI) eqns.
bias. For the sake of brevity we omit the boxplot for the permutation importance which was already discussed in (Strobl et al., 2007a) and only compare eqns. (1) and (2). Encouragingly,
Figure 5: Results of the null case, where none of the predictor variables is informative.
both methods yield low scores for all predictors. A recurring pattern of method (1) are the relatively large negative scores that depend on the cardinality of the variables. As mentioned before, it would be straightforward to clip those to zero if one was to derive no value from learning about the degree of overfitting. More worrisome are the notable differences in the variance of the distributions for predictor variables with different scale of measurement or number of categories. The results from the power study are summarized in Figure 6. MDI again shows a strong bias towards variables with many categories and the continuous variable. At the chosen sigal-to-noise ratio it fails to identify the relevant predictor variable. In fact, the mean value for the relevant variable is lowest and only slightly higher than in the null case. Both methods (1) and (2) clearly succeed in identifying as the most relevant feature. While the large fluctuations of the importance scores for and especially are bound to yield moderate “false positive” rates and incorrect rankings in single trials. The negative scores of lead to a larger separation of the signal from noise and hence more reliable correct identification of the most important variables.
Figure 6: Results of the null case, where only is informative. Simulation details as in
2.3 Sample Variance Bias Correction
Recall that the Gini index is often viewed as the variance of a Bernoulli process
At the same time an unbiased estimator of the variance of any data set
which is simply the sample version of the Bernoulli variance. Figure 7 shows the results for the same simulations outlined in section 2.2. (We omitted
underperformance.) Several stark differences to Figures 5 and 6 are immediately obvious: (i)
its unbiased attribute and shows distorted rankings that depend on the number of categories.
2.4 Regression
While we leave the implementation of the regression case for a future version of the package, the idea of replacing the inbag loss (in this case MSE) with the OOB loss remains the same.
Figure 7: Simulation results for the modified . (top row) The null case, where none
The equivalents of Eqns. (1)-(3) will be even more straightforward as the penalty term becomes unnecessary if we replace ¯
It seems somewhat surprising that the careful distinction between validation and training test has not been drawn as consistently for variable importance measures as for most other metrics in statistical modeling. From a machine learning perspective, one would want to base feature importance scores only on the performance of an appropriate loss function on a validation set. Misclassification rate as a function of ˆ) suffers from a discontinuity at ˆ5. Instead we advocate the modified Gini impurity measures defined in Eqns. (1)-(3), which encode some important differences in philosophy. For perfect agreement, ˆ, all three scores reduce to the conventional Gini impurity with values between [0will reach their maximum of 1 when the disagreement is the highest, ’s philosophy is such that no constellation can yield a loss higher than 0.5 which would reflect complete uncertainty. Hence, e.g. ˆis no more impure than ˆ
The crucial difference between (1) and (2) is symmetry: advocates an average impu- rity between train and test whereas weighs inbag much less directly. In particular, 5 if either ˆ5 and in that case is independent of the respective other proportion. We have further shown that are unbiased for uninformative features. In combination with the poor performance of
Summarizing, we view the well established bias in Gini importance measures as another manifestation of overfitting in machine learning models and propose a straightforward solution to this problem. We have demonstrated its effectiveness on 2 real and one simulated data set. While our solution is clearly amenable to any tree-based method, the bias is most pronounced for deep trees, hence the focus on Random Forests.
L. Breiman. Random forests. Machine Learning, 45, 2001. 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.
Michael P Cummings and Daniel S Myers. Simple statistical models predict c-to-u edited sites in plant mitochondrial rna. BMC Bioinformatics, 5(1):132, 2004.
Ram´on D´ıaz-Uriarte and Sara Alvarez De Andres. Gene selection and classification of mi- croarray data using random forest. BMC Bioinformatics, 7(1):3, 2006.
Ulrike Gr¨omping. Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4):308–319, 2009.
Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A con- ditional inference framework. Journal of Computational and Graphical statistics, 15(3): 651–674, 2006.
Hyunjoong Kim and Wei-Yin Loh. Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454):589–604, 2001.
Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased mdi feature importance measure for random forests. arXiv preprint arXiv:1906.10845, 2019.
Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2 (3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
Markus Loecher. rfVarImpOOB: Unbiased Variable Importance for Random Forests, 2019. URL https://CRAN.R-project.org/package=rfVarImpOOB. R package version 1.0.
Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification trees. Statistica sinica, pages 815–840, 1997.
Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich, and Fred A Hamprecht. A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 10(1):213, 2009.
Stefano Nembrini, Inke R K¨onig, and Marvin N Wright. The revival of the gini importance? Bioinformatics, 34(21):3711–3718, 2018.
Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct): 2825–2830, 2011.
Marco Sandri and Paola Zuccolotto. A bias correction algorithm for the gini variable impor- tance measure in classification trees. Journal of Computational and Graphical Statistics, 17(3):611–628, 2008.
Y-S Shih. A note on split selection bias in classification trees. Computational statistics & data analysis, 45(3):457–466, 2004.
Yu-Shan Shih and Hsin-Wen Tsai. Variable selection bias in regression trees with constant fits. Computational statistics & data analysis, 45(3):595–607, 2004.
C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 2007a. 10.1186/1471-2105-8-25. URL https://doi.org/10.1186/1471-2105-8-25.
Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classification trees based on the gini index. Computational Statistics & Data Analysis, 52 (1):483–501, 2007b.
Marvin N Wright and Andreas Ziegler. ranger: A fast implementation of random forests for high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409, 2015.
Marvin N Wright, Stefan Wager, Philipp Probst, and Maintainer Marvin N Wright. Package ranger. 2017.
Zhengze Zhou and Giles Hooker. Unbiased measurement of feature importance in tree-based methods. arXiv preprint arXiv:1903.05179, 2019.
4.1 Expected Values
The decrease in impurity (∆G) for a parent node m is the weighted difference between the Gini importance) and those of its left and right children:
We assume that the node m splits on an independent.
We will use the short notation either equal to oob or in and rely on the following facts and notation:
1. is the “population” proportion of the class label in the OOB test
2. is the “population” proportion of the class label in the inbag test data
3.
4.
Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of
4.1.1 E(∆PG(0)oob) ̸= 0
We use the shorter notation
We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.
4.1.2 E(∆�PG
The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction (5), as outlined in the main text. We now show that the bias for
this modified Gini impurity is zero for OOB data. As before,
4.1.3 E(∆PG(2)oob) = 0
We can rewrite as follows:
E[∆PG