b

DiscoverSearch
About
My stuff
Trees, forests, and impurity-based variable importance
2020·arXiv
Abstract
Abstract

Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their ability to detect sparse signals and their resulting good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require to understand the phenomena underlying the data, which is accessible only with an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Mean Decrease Impurity (MDI) is one of the two variable importance measures in random forests. However, there is no theoretical justification to use MDI: we do not even know what this indicator estimates. In this paper, we analyze MDI and prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined.

image

Decision trees [see Loh, 2011, for an overview] can be used to solve pattern recognition tasks in an interpretable way: in order to build a prediction, trees ask to each observation a series of questions, each one being of the form “Is variable  X(j) largerthan a threshold  s?” where j, sare to be determined by the algorithm. Thus the prediction for a new observation only depends on the sequence of questions/answers for this observation. One of the most popular decision trees is the CART procedure [Classification And Regression Trees, Breiman et al., 1984], which unfortunately suffers from low accuracy and intrinsic instability: modifying slightly one observation in the training set can change the entire tree structure together with the predicted values in some areas of the input space.

To overcome this last issue, and also improve their accuracy, ensemble methods, which grow many base learners and aggregate them to predict, have been designed. Random forests, introduced by Breiman [2001], are among the most famous tree ensemble methods. They proceed by growing trees based on CART procedure [Breiman et al., 1984] and randomizing both the training set and the splitting variables in each node of the tree. Breiman’s [2001] random forests have been under active investigation during the last two decades mainly because of their good practical performance and their ability to handle high-dimensional data sets. They are acknowledged to be state-of-the-art methods in fields such as genomics [Qi, 2012] and pattern recognition [Rogez et al., 2008], just to name a few. If empirical performances of random forests are not to be demonstrated anymore [Fern´andez-Delgado et al., 2014], their main flaw relies on their lack of interpretability because their predictions result from averaging over a large number of fully-grown trees (typically several hundreds): each tree may be interpreted, due to its recursive structure, but random forests remain an obscure black-box procedure.

Since interpretability is a concept difficult to define precisely, people eager to gain insights about the driving forces at work behind random forests predictions often focus on variable importance, a measure of the influence of each input variable to predict the output. In Breiman’s [2001] original random forests, there exist two importance measures: the Mean Decrease Impurity [MDI, or Gini importance, see Breiman, 2002], which sums up the gain associated to all splits performed along a given variable; and the Mean Decrease Accuracy [MDA, or permutation importance, see Breiman, 2001] which shuffles entries of a specific variable in the test data set and computes the difference between the error on the permuted test set and the original test set. Because of its very definition, MDI is an importance measure that can be computed for trees only, since it strongly relies on the tree structure, whereas MDA is an instantiation of the permutation importance that can be used for any predictive model. Both measures are used in practice even if they possess several major drawbacks.

MDI is known to favor variables with many categories [see, e.g., Strobl et al., 2007, Nicodemus, 2011]. Even when variables have the same number of categories, MDI exhibits empirical bias towards variables that possess a category having a high frequency [Nicodemus, 2011, Boulesteix et al., 2011]. MDI is also biased in presence of correlated features [Nicodemus and Malley, 2009]. A promising way to assess variable importance in random forest consists in designing new tree building process or new feature importances as in the R package party [Strobl et al., 2008, 2009, R Core Team, 2013] or in Li et al. [2019], Zhou and Hooker [2021]. On the other hand, MDA seems to exhibit less bias than MDI but tends to overestimate correlated features Strobl et al. [2008]. Besides, its scale version [the default version in the R package randomForest, Liaw and Wiener, 2002] depends on the sample size and on the number of trees, the last being an arbitrary chosen parameter [Strobl and Zeileis, 2008]. The interested reader may refer to Genuer et al. [2008] and Genuer et al. [2010] for an extensive simulation study about the influence of the number of observations, variables, and trees on MDA together with the impact of correlation on this importance measure. Despite all their shortcomings, one great advantage of MDI and MDA is their ability to take into account interaction between features even if unable to determine the part of marginal/joint effect [Wright et al., 2016].

From a theoretical perspective, there are only but a few results on variable importance computed via tree procedures. The starting point for theory is by Ishwaran [2007] who studies a modified version of permutation importance and derives its asymptotic positive expression, when the regression function is assumed to be piecewise constant. A scaled population version of the permutation importance is studied by Zhu et al. [2012] who establish rate of convergence for the probability that a relevant feature is chosen in a modified random forest algorithm. The exact population expression of permutation importance is computed by Gregorutti et al. [2017] for a Gaussian linear model.

Regarding the MDI, there are even fewer results. Louppe et al. [2013] study the population MDI when all features are categorical. In this framework, they propose a decomposition of the total information, depending on the MDI of each variable and of interaction terms. They also prove that the MDI of an irrelevant variable is zero and that adding irrelevant variables does not impact the MDI of relevant ones. Even if these results may appear to be very simple, the proofs are not straightforward at all, which explains the few results on this topic. This work was later extended by Sutera et al. [2016] to the case of context-dependent features.

Whereas recent works tackled the question of (in)consistency of the MDA [Ramosaj and Pauly, 2019, B´enard et al., 2021], there exists no general result establishing the consistency of MDI toward population quantities when computed with the original random forest algorithms: all existing results about MDI focus on modified random forests version with, in some cases, strong assumptions on the regression model. Therefore, there are no guarantees that using impurity-based variable importance computed via random forests is suitable to select variables, which is nevertheless often done in practice.

Agenda. In this paper, we focus on the regression framework and derive theoretical properties about the MDI used in Breiman’s [2001] random forests. Section 2 is devoted to notations and describes tree and forest construction, together with MDI calculation. The main results are gathered in Section 3. We prove that both population MDI and empirical MDI, computed with Breiman’s random forests, can be seen as a percentage of explained variance and are directly related to the quadratic risk of the forest, a result already proved by Louppe et al. [2013] and Sutera et al. [2016] for the population MDI in presence of categorical features. We also derive the expression of the sum of MDIs for two groups of variables that are independent and do not interact, which is at the core of our analysis. The analysis highlights that fully-developed random forests (whose tree leaves contain exactly one observation) produce inaccurate variable importances and should not be used for that purpose. In Section 4, we establish that for additive regression models with independent inputs, the MDI of each variable takes a very simple, interpretable form, which is the variance of each univariate component mj(X(j)). We prove that empirical MDI computed with Breiman’s forest targets this population quantity, which is nothing but a standard way to decompose the variance of the output. Hence, in the absence of input dependence and correlation, MDI computed with random forests provides a good assessment of the variable importance. To the best of our knowledge, it is the first result proving the good properties of the empirical MDI computed with Breiman’s forests. To move beyond the additive models and understand the impact of interactions, we study a multiplicative model in Section 5. In this model, MDI is ill-defined and one needs to aggregate many trees to obtain a consistent MDI value. Averaging trees is known to be of great importance to improve accuracy but it turns out to be mandatory to obtain consistent variable importance measure in the presence of interactions. The impact of dependence between input variables is studied in Section 6. We stress out that correlated variables are more likely to be chosen in the splitting procedure and, as a consequence, exhibit a larger MDI than independent ones. As for the presence of interactions, this highlights the importance of computing MDI by averaging many trees. Proofs are postponed to Section 7. Experiments on MDI can be found on the ArXiv version of this paper.

We assume to be given a data set  Dn = {(Xi, Yi), i = 1, . . . , n} of i.i.d.observations (Xi, Yi) distributed as the generic pair (X, Y ) where X ∈ [0, 1]d and Y ∈ R withE[Y 2] < ∞. Our aim is to estimate the regression function  m : [0, 1]d → R, defined as m(X) = E[Y |X], using a random forest procedure and to identify relevant variables, i.e. variables on which m depends.

Remark 1. Note that we confine the analysis to the regression setting with (bounded) continuous input variables. Extending our work to the classification setting is not straightforward, since our main result relies on the additive regression framework which does not have any equivalent in classification. Assuming that all input variables are continuous is a mild and common hypothesis in the random forest literature [see, e.g., Arlot and Genuer, 2014, Wager and Walther, 2015, Zhou and Hooker, 2021, just to name a few] which drastically eases the mathematical analysis.

2.1 CART construction and empirical MDI

CART [Breiman et al., 1984] is the elementary component of random forests [Breiman, 2001]. CART construction works as follows. Each node of a single tree T is associated with a hyper-rectangular cell included in [0, 1]d. The root of the tree is [0, 1]d itself and, at each step of the tree construction, a node (or equivalently its corresponding cell) is split in two parts. The terminal nodes (or leaves), taken together, form a partition of [0, 1]d. An example of such tree and partition is depicted in Figure 1, whereas the full procedure is described in Algorithm 1.

We define now the CART-split criterion. To this aim, we let  A ⊂ [0, 1]d be ageneric cell and  Nn(A) be the number of data points falling into A. A split in A is a pair (j, z), where j is a dimension in {1, . . . , d} and z is the position of the cut along the j-th coordinate, within the limits of  A. We let CAbe the set of all such possible cuts in A. Then, for any (j, z) ∈ CA, the CART-split criterion [Breiman et al., 1984] takes the form

image

image

Figure 1: A decision tree of depth k = 2 in dimension d = 2 (right) and the corresponding partition (left).

image

where  AL = {x ∈ A : x(j) < z}, AR = {x ∈ A : x(j) ≥ z}, and ¯YA (resp., ¯YAL, ¯YAR) isthe average of the  Yi’s belonging to  A (resp., AL, AR), with the convention 0/0 = 0. At each cell A, the best split (jn,A, zn,A) is finally selected by maximizing  Ln,A(j, z)over  CA, that is

image

To remove ties in the argmax, the best cut is always performed along the best cut direction  jn,A, at the middle of two consecutive data points. Once the tree has been built, the tree prediction  mn(x) for a new query point x is computed as the average of the  Yifalling into the same leaf as x.

The CART-split criterion  Ln,A(jn,A, zn,A) can be rewritten as the difference in impurity between the cell A and its two children, where the impurity is simply defined as the variance of the output in regression [see also Chapter 4 in Breiman et al., 1984]. The impurity is thus related to the splitting criterion and can be used to assess variable importances via MDI. More precisely, the Mean Decrease in Impurity (MDI) for the

variable  X(j) computed via a tree T is defined by

image

where the sum ranges over all cells A in T that are split along variable  j and pn,A isthe fraction of observations falling into A. In other words, the MDI of  X(j) computesthe weighted decrease in impurity related to splits along the variable  X(j).

2.2 Theoretical CART construction and population MDI

In order to study the (empirical) MDI defined in equation (3), we first need to define and analyze the population version of MDI. First, we define a theoretical CART tree, as in Algorithm 1, except for the splitting criterion which is replaced by its population version. Namely, for all cells A and all splits (j, z) ∈ CA, the population version of the CART-split criterion is defined as

image

Therefore in each cell of a theoretical tree  T ⋆, the best split (j⋆A, z⋆A) is chosen by maximizing the population version of the CART-split criterion that is

image

Of course, in practice, we cannot build a theoretical tree  T ⋆ since it relies on the true distribution (X, Y ) which is unknown. A theoretical tree is just an abstract mathematical object, for which we prove properties that will be later extended to the (empirical) CART tree, our true object of interest.

As for the empirical MDI defined above, the population MDI for the variable  X(j)computed via the theoretical tree  T ⋆ is defined as

image

where all empirical quantities have been replaced by their population version and the theoretical CART tree is used instead of the empirical CART tree defined in Section 2.1. We also let  p⋆A = P[X ∈ A].

Random forest A random forest is nothing but a collection of several decision trees whose construction has been randomized. In Breiman’s forest, CART procedure serves as base learner and the randomization is performed in two different ways:

1. Prior to the construction of each tree, a sample of the original data set is randomly drawn. Only this sample is used to build the tree. Sampling is typically done using bootstrap, by randomly drawing n observations out of the original n observations with replacement.

2. Additionally, for each cell, the best split is not selected along all possible variables. Instead, for each cell, a subsample of mtry variables is randomly selected without replacement. The best split is selected only along these variables. By default, mtry = d/3 in regression.

Random forest prediction is then simply the average of the predictions of such randomized CART trees. Similarly, the MDI computed via a forest is nothing but the average of the MDI computed via each tree of the forest.

In the sequel, we will use the theoretical random forest (forest that aggregate theoretical CART trees) and the population MDI to derive results on the empirical random forest (forest that aggregate empirical CART trees; the one widely used in practice) and the empirical MDI.

For any theoretical CART tree  T ⋆, and for any  k ∈ N, we denote by  T ⋆k the truncation of  T ⋆ at level k, that is the subtree of  T ⋆ rooted at [0, 1]d, in which each leaf has been cut at most k times. Let  AT ⋆k (x) be the cell of the tree  T ⋆k containing x. For anyfunction  f : [0, 1]d → Rand any cell  A ⊂ [0, 1]d, let

image

be the variance of the function f in the cell A with respect to the distribution of X. Proposition 1 states that the population MDI computed via theoretical CART trees can be used to decompose the variance of the output, up to some residual term which depends on the risk of the theoretical tree estimate.

Proposition 1. Assume that  Y = m(X) + ε, where εis a noise satisfying  E[ε|X] = 0and  V[ε|X] = σ2 almost surely, for some constant  σ2. Consider a theoretical CART tree  T ⋆. Then, for all  k ≥ 0,

image

where X′ is an i.i.d. copy of X.

Note that few assumptions are made in Proposition 1, which makes it very general. In addition, one can see that the sum of population MDI is close to the  R2 measureused to assess the quality of regression model. The population  R2 is defined as

image

Hence, the sum of MDI divided by the variance of the output corresponds to the percentage of variance explained by the model, which is exactly the population  R2.Therefore, Proposition 1 draws a clear connection between MDI and the very classical R2 measure.

We say that a theoretical tree  T ⋆ is consistent if limk→∞ ∆(m, AT ⋆k (X′)) = 0. If atheoretical tree is consistent, then its population  R2 tends to V[m(X)]/V[Y] as shown in Corollary 1 below.

Corollary 1. Grant assumptions of Proposition 1. Additionally, assume that  ∥m∥∞ <∞and, almost surely, limk→∞ ∆(m, AT ⋆k (X′)) = 0. Then, for all  γ > 0, there exists K such that, for all k > K,

image

Consistency (and, in this case, tree consistency) is a prerequisite when dealing with algorithm intepretability. Indeed, it is hopeless to think that we can leverage information about the true data distribution from an inconsistent algorithm, intrinsically incapable of modelling these data. Therefore, we assume in this section that theoretical trees are consistent and study afterwards variable importances produced by such algorithm. This strategy has been also adopted in Ramosaj and Pauly [2019], who assume the consistency of the (infinite) empirical random forests [Assumption A5, page 6; see also Assumption A2, page 9 in B´enard et al., 2021]. In the next sections, we will be able to prove tree consistency for specific regression models, allowing us to remove the generic consistency assumption in our results, as the one in Corollary 1.

Corollary 1 states that if the theoretical tree  T ⋆k is consistent, then the sum of the MDI of all variables tends to the total amount of information of the model, that is V[m(X)]. This gives a nice interpretation of MDI as a way to decompose the total variance V[m(X)] across variables. Louppe et al. [2013] prove a similar result when all features are categorical. This result was later extended by Sutera et al. [2016] in the case of context-dependent features. In their case, trees are finite since there exists only a finite number of variables with a finite number of categories. The major difference with our analysis is that trees we study can be grown to arbitrary depth, since we consider continuous input variables, along which an infinite number of splits can be performed. We thus need to control the error of a tree stopped at some level k, which is exactly the right-hand term in equation (5).

According to Corollary 1, the sum of the MDI is the same for all consistent theoretical trees: even if there may exist many theoretical trees, the sum of MDI computed for each theoretical tree tends to the same value, regardless of the considered tree. Therefore, all consistent theoretical tree produce the same asymptotic value for the sum of population MDI.

In practice, one cannot build a theoretical CART tree or compute the population MDI. Proposition 2 below is the extension of Proposition 1 for the empirical CART procedure. Let  Tnbe the (empirical) CART tree built with data set  Dnand let, for all k, Tn,kbe the truncation of  Tn at level k. Denote by �V[Y] the empirical variance of Y computed on the data set  Dn. Define, for any function  f : [0, 1]d �→ R, the empirical quadratic risk via

image

Proposition 2 (Empirical version of Proposition 1). Let  Tnbe the CART tree, based on the data set  Dn. Then,

image

where  fTnis the estimate associated to  Tn, as defined in Algorithm 1.

Note that equations (5) and (6) in Proposition 1 and Proposition 2 are valid for general tree construction. The main argument of the proof is simply a telescopic sum of the MDI which links the variance of Y in the root node of the tree to the variance of Y in each terminal node. These equalities hold for general impurity measures by providing a relation between root impurity and leaves impurity. As in Proposition 1, Proposition 2 proves that the  R2 of a tree can be written as

image

which allows us to see the sum of MDI as the percentage of variance explained by the model. This is particularly informative about the quality of the tree modelling when the depth k of the tree is fixed.

Trees are likely to overfit when they are fully grown. Indeed, the risk of trees which contain only one observation per leaf is exactly zero. Hence, according to equation (6), we have

image

and the  R2 of such tree is equal to one. Such  R2 does not mean that trees have good generalization error but that they have enough approximation capacity to overfit the data. Additionally, for a fully grown tree  Tn, we have

image

For fully grown trees, the sum of MDI contains not only all available information V[m(X)] but also the noise in the data. This implies that the MDI of some variables are higher than expected due to the noise in the data. The noise, when having a larger variance than the regression function, can induce a very important bias in the MDI by overestimating the importance of some variables. The bias of MDI has already been empirically studied. To circumvent the overfitting problem of MDI computations (using a single data set to build the trees and estimate the variable importances), Zhou and Hooker [2021] use a test set to compute MDI that are less biased. A quantification of the MDI bias has been proposed recently in Li et al. [2019] in which they show that deep trees tend to exhibit larger MDI bias for irrelevant variables. Their analysis provides finite-sample bounds for MDI of irrelevant variables whereas our analysis is asymptotic for both relevant and irrelevant variables.

The above discussion stresses out that, when interested by interpreting a decision tree, one must never use the MDI measures output by a fully grown tree. However, if the depth of the tree is fixed and large enough, the MDI output by the tree provides a good estimation of the population MDI as shown in Theorem 1.

Model 1. The regression model satisfies  Y = m(X) + ε where mis continuous, X admits a density bounded from above and below by some positive constants and  ε is anindependent Gaussian noise of variance  σ2.

Theorem 1. Assume Model 1 holds and that for all theoretical CART tree  T ⋆, almostsurely,

image

Let  Tnbe the CART tree based on the data set  Dn. Then, for all  γ > 0, ρ ∈ (0, 1],there exists  K ∈ N⋆ such that, for all k > K, for all n large enough, with probability at least 1  − ρ,

image

As for Corollary 1, Theorem 1 relies on the consistency of theoretical trees, which is essential to obtain positive results on tree interpretability. It is worth noticing that Theorem 1 is not a straightforward extension of Corollary 1. The proof is based on the uniform consistency of theoretical trees and combines several recent results on tree partitions to establish the consistency of empirical CART tree.

It is not possible in general to make explicit the contribution of each MDI to the sum. In other words, it is not easy to find an explicit expression of the MDI of each variable. This is due to interactions and correlation between inputs, which make difficult to distinguish the impact of each variable. Nevertheless, when the regression model can be decomposed into a sum of two independent terms, we can obtain more precise information on MDI. To this aim, let, for any set  J = {j1, . . . , jJ} ⊂ {1, . . . , d},X(J ) = (X(j1), . . . , X(jJ)).

Model 2. There exists  J ⊂ {1, . . . , d} such that X(J ) and X(J c) are independent and

image

where  mJ and mJ care continuous functions,  εis a Gaussian noise  N(0, σ2) inde-pendent of X and X admits a density bounded from above and below by some positive constants.

Lemma 1. Assume Model 2 holds. Then, for all  j ∈ J, the criterion  L⋆A(j, s) doesnot depend on  mJ cand is equal to  L⋆A(J )(j, s). Besides, any split  j ∈ Jleaves the variance of  mJ cunchanged.

Lemma 1 will be used to prove Proposition 3 below but is informative on its own. It states that if the regression model can be decomposed as in Model 2, the best split along coordinates in J does not depend on  mJ cor on the ranges of coordinates in  J c.

Proposition 3. Assume Model 2 holds. Let  T ⋆J be a tree built on distribution (X(J ),mJ (X(J ))) and let T ⋆J ,k be the truncation of the tree at level k. Similarly, let  T ⋆J c bea tree built on distribution (X(J c), mJ c(X(J c))) and let T ⋆J c,k be the truncation of the tree at level k. If for any tree  T ⋆J and T ⋆J c, almost surely,

image

then, for any tree  T ⋆ built on the original distribution (X, Y ), almost surely,

image

Proposition 3 states that the consistency of any theoretical trees results from the consistency of all theoretical trees built on subsets (J and J c) of the original set of variables, under Model 2. This is of particular significance for proving the consistency of theoretical tree for a new model which can be rewritten as a sum of independent regression functions: we can then deduce the tree consistency from existing consistency results for each regression function. Note that there are no structural assumptions on the functions  mJ and mJ capart from the continuity. In particular, the class of functions in Model 2 is wider than additive functions. Note also that importance measures via random forests have been designed for groups of variables. For instance, Gregorutti et al. [2015] extended the MDA to groups of variables and studied it theoretically in an additive regression framework. More recently, a tree algorithm tailored for groups of variables have been proposed by Poterie et al. [2019].

Theorem 2 below establishes the limiting value of the sum of MDI in subgroups J and  J c, both for the population version and the empirical version of the MDI, as soon as all theoretical trees are consistent.

Theorem 2. Under the assumptions of Proposition 3, for all  γ > 0, there exists K such that, for all k > K,

image

Besides, for all  γ > 0, ρ ∈ (0, 1], there exists K such that, for all k > K, for all n large enough, with probability at least 1  − ρ,

image

According to Theorem 2, the limiting value of the MDI for subgroup J (resp. subgroup  J c) is equal to the variance of the corresponding component of the regression function, that is  V[mJ (X(J ))] (resp. V[mJ c(X(J c))]). We will use Theorem 2 in the Section 4 to derive explicit expressions for the MDI of each variable, assuming a specific form of the regression function.

To pursue our analysis, we consider in this section a specific type of regression models: additive regression functions with independents inputs.

Model 3 (Additive model). The regression model writes

image

where each  mjis continuous;  εis a Gaussian noise  N(0, σ2), independent of X; and X  ∼ U([0, 1]d).

Additive regression models were popularized by Stone [1985] [see also Hastie and Tibshirani, 2017]. They were used as a generic framework for studying random forests in some previous works Klusowski [2020], Scornet et al. [2015], notably due to their simple structure and their absence of interactions. Indeed, due to the model additivity, there is no interaction between variables: the contribution of each variable for predicting the output is reduced to its marginal contribution. Since we also assume that features are independent, the information contained in one variable cannot be inferred by the values of other variables.

By considering a model with no interaction and with independent inputs, we know that the contribution of a variable has an intrinsic definition which does not depend on other variables that are used to build the model. In Model 3, the explained variance of the model V[m(X)] takes the form

image

which holds only because of independent inputs and the absence of interactions. The variance explained by the jth variable is defined unambiguously, and independently of which variables are considered in the model, as  V[mj(X(j))]. Therefore any importance measure for  X(j) which aims at decomposing the explained variance must output V[mj(X(j))]. It turns out that the MDI computed via CART trees converges to this quantity.

Theorem 3 (Additive model). Assume that Model 3 holds. Let  T ⋆ be any theoretical CART tree and  Tnbe the empirical CART tree. Then, for all  γ > 0, there exists K such that, for all k > K, for all j,

image

Moreover, for all  γ > 0, ρ ∈ (0, 1], there exists K such that, for all k > K, for all n large enough, with probability at least 1  − ρ, for all j,

image

Since the MDI computed via random forests is nothing but the average of MDI computed by trees, Theorem 3 also holds for MDI computed via random forests. To the best of our knowledge, Theorem 3 is the first result highlighting that empirical MDI computed with CART procedure converge to reasonable values that can be used for variable selection, in the framework of Model 3.

Contrary to Theorem 2, Theorem 3 does not assume the consistency of the theoretical tree. Indeed, in the context of additive models, one can directly take advantage of Scornet et al. [2015] to prove the consistency of theoretical CART trees.

Surprisingly, the population version of MDI has the same expression as the population version of MDA (up to factor 2), as stated in Gregorutti et al. [2017]. Thus, in the particular context of additive models with independent features, both MDI and MDA target the same quantity, which is the natural way to decompose the total variance. In this context, MDI and MDA can then be employed to rank features and select variables based on the importance values they produce.

Note that MDI is computed with truncated trees, i.e. trees that contain a large number of observations in each node. This is mandatory to ensure that the variance of the output in the resulting cell is correctly estimated. As mentioned before, considering fully grown trees would result in positively-biased MDI, which can lead to unintended consequences for variable selection. We therefore stress that MDI must not be computed using fully grown trees.

Theorem 3 proves that MDI computed with any CART tree are asymptotically identical: the MDI are consistent across trees. The only interest to use many trees to compute the MDI instead of a single one relies on the variance reduction property of ensemble methods, which allows us to reduce the variance of the MDI estimate when using many trees.

Remark: invariance under monotonous transformations Tree-based methods are known to be invariant by strictly monotonous transformations of each input. Letting  fjbeing any monotonous transformation applied to variable  X(j), the newregression model writes

image

where  fj(X(j)) are the new input variables. Assuming that Theorem 3 can be applied in the more general setting where  X(j) ∼ fj(U([0,1])), the MDI for this new problem satisfies

image

Therefore the asymptotic MDI is invariant by monotonic transformation, assuming that Theorem 3 holds for any marginal distribution of X (with independent components).

The most famous instance of additive regression models is the well-studied linear model [see, e.g., Rencher and Schaalje, 2008, Searle and Gruber, 2016].

Model 4 (Linear model). The regression model writes

image

where, for all  j, αj ∈ R; εis a Gaussian noise  N(0, σ2), independent of X; and X  ∼ U([0, 1]d).

As for Model 3, there is only one way to decompose the explained variance for linear models with independent input, as stated by Nathans et al. [2012] and Gr¨omping [2015] (and the references therein), which is  α2jV[X(j)]. Theorem 4below proves that MDI converge to these quantities and provides information about the theoretical tree structure in Model 4.

Theorem 4 (Linear model). Assume that Model 4 holds. For any cell  A = �dℓ=1[aℓ, bℓ] ⊂[0, 1], the coordinate which maximizes the population CART-split criterion (4) satisfies

image

and the splitting position is the center of the cell; the associated variance reduction is

image

Let  T ⋆ be any theoretical CART tree and  Tnbe the empirical CART tree. Then, for all  γ > 0, there exists K such that, for all k > K and for all j,

image

Moreover, for all  γ > 0, ρ ∈ (0, 1], there exists K such that, for all k > K, for all n large enough, with probability at least 1  − ρ, for all j,

image

Theorem 4 establishes that, for linear models with independent inputs, the MDI computed with CART trees whose depth is limited is exactly the quantity of interest, which depends only on the magnitude of  α2j, since all variables have the same variance. Theorem 4 is a direct consequence of Theorem 3. However, for linear models, one can compute explicitly the best splitting location and the associated variance decreasing, which is stated in Theorem 4. Being able to compute the variance decreasing for each split allows us to compute the MDI directly, without using Theorem 3. Note that the splitting location and variance reduction has been first stated by Biau [2012] and then by Ishwaran [2013].

Remark 2. All our results focus on the asymptotic behaviour of the MDI computed with CART. A natural next step in our analysis would be to derive rates of convergence for MDI, in order to quantify the accuracy of a variable ranking based on MDI. Unfortunately, we are limited by the (asymptotic) tree consistency assumption (see assumption in Corollary 1). If we had access to the rate of consistency of a tree, we could leverage on that to establish upper bounds on the rate of consistency of MDI. Although interesting, this line of work seems difficult. However, some generic arguments based on partition shapes have been used to establish rate of consistency for groups of MDI [MDI of irrelevant variables, see Theorem 1 in Li et al., 2019]. Interesting finite-sample lower bounds on MDI can also be found in Klusowski [2019].

In this section, we studied additive (linear) models and proved that, for these models with independent inputs, MDI computed with truncated CART is consistent. This result is based on CART consistency proved in Scornet et al. [2015] for additive models with independent inputs. One can naturally wonder what happens if the additivity assumption (or the independence assumption) is removed. In these cases, the general approach, developed in this section, is unfortunately not valid. The next two sections focus on two specific settings, and show that, when one of these two assumptions is removed, the MDI is ill-defined.

In the absence of interaction and dependence between inputs, the MDI is a measure of variable importance which is independent of the set of variables included in the model, as stated in Theorem 3. However, this is not true as soon as we consider classes of models that present interactions or dependence. In this section, we study the influence of interactions via the following model.

Model 5 (Multiplicative model). Let  α ∈ R. The regression model writes

image

where  α ∈ R; εis a Gaussian noise  N(0, σ2), independent of X; and X  ∼ U([0, 1]d).

Theorem 5 (Multiplicative model). Assume that Model 5 holds. For any cell A = �dℓ=1[aℓ, bℓ] ⊂ [0, 1], the coordinate which maximizes the population CART-split crite- rion (4) satisfies

image

and the splitting position is the center of the cell; the associated variance reduction is

image

Let  T ⋆ be any theoretical CART tree and  Tnbe the empirical CART tree. Then, for all  γ > 0, there exists K such that, for all k > K,

image

Moreover, for all  γ > 0, ρ ∈ (0, 1], there exists K such that, for all k > K, for all n large enough, with probability at least 1  − ρ,

image

Theorem 5 gives the explicit splitting position for a model with interactions. Deriving splitting positions allows us to prove that the theoretical tree is consistent which, in turns, proves that the sum of MDI converges to the explained variance V[m(X)], according to Theorem 1.

We are also interested in obtaining the exact MDI expression for each variable, as established for additive models. However, Theorem 2 no longer applies since the regression model cannot be decomposed into independent additive components. It turns out to be impossible to derive explicit expression for each MDI in this model. To see this, note that there exist many theoretical trees in Model 5. Two of them are displayed in Figure 2. They result in the same partition but the first split can be made either along  X(1) (Figure 2, left) or X(2) (Figure 2, right). Surprisingly, the variable importance computed with these trees is larger for the variable that is split in the second step. This exemplifies that splits occurring in the first levels of trees do not lead to the largest decrease in variance, which in turn, shows that variable importances based on the frequency of splitting along a variable in the first level(s) of a tree are inaccurate. Nonetheless, this holds only when interactions are present: in the case of the additive models defined in Section 4, the decrease in variance is stronger in the first tree levels. A direct consequence of this fact is that two different theoretical tree can output two different variable importance, as shown by Lemma 2 below.

image

Figure 2: Two theoretical tree partitions of level k = 2: the first split is performed along  X(1) (resp. X(2)) for the tree on the left (resp. on the right).

Lemma 2. Assume that Model 5 holds. Then, there exists two theoretical trees  T1 and

image

According to Lemma 2, there exist many different theoretical trees for Model 5 and the MDI may vary when computed with different theoretical trees. This is a major difference with additive models for which each theoretical tree asymptotically output the same MDI. Since MDI are usually used to rank and select variables, the fact that each tree can output a different MDI, and therefore a different variable ranking, is a major drawback for employing this measure in presence of interactions. One way to circumvent this issue could be to compute the MDI via a random forest: the randomization of splitting variables yields different trees and the aggregation of MDI across trees provides a mean effect of the variable importance. For example, in Model 5, one can hope to obtain importance measure that are equal since Model 5 is symmetric in all variables. This is impossible with only one tree but is a reachable goal by computing the mean of MDI across many trees as done in random forests. More generally, for simple regression problems, a single tree may be sufficient to predict accurately but many shallow trees are needed to obtain accurate MDI values.

Our previous results relies on the (often unrealistic) assumption that input variables are uniformly distributed over [0, 1]d, and thus independent. To gain insights on the impact of input dependence on MDI, we study a very specific model in which input features are not independent. We will prove that even in this very simple model, MDI has some severe drawbacks that need to be addressed carefully. Accordingly, we consider the simple case where the input vector X  = (X(1), X(2)) is of dimension two and has a distribution, parametrized by  β ∈ N, defined as

image

The distribution of X is uniform on 2β squares on the diagonal. Examples of such distribution are displayed in Figure 3. For such distribution, the correlation between

image

Figure 3: Illustration of distributions for X, with parameter  β = 1 (left) and β = 2(right).

X(1) and X(2) is given by Lemma 3.

Lemma 3. Let X  = (X(1), X(2)) ∼ U⊗2β as defined in (9). Then

image

For  β = 0, the distribution of Xis uniform over [0, 1]2 and accordingly, the correlation is null between the two components of X, as stated in Lemma 3. When  β(i.e. the number of squares) increases, that is the size of each square decreases, the distribution concentrates on the line  X(1) = X(2). Therefore, when  βtends to infinity, the correlation should tend to 1, which is proved in Lemma 3.

The distribution defined in (9) is rather unusual, and one can wonder why not considering Gaussian distributions instead. This choice is related to the thresholding nature of decision trees. Since we want to compute explicitly the splitting criterion along each coordinate, we must have a closed expression for the truncated marginals (restriction to any rectangle of [0, 1]2), which is not possible with Gaussian distribution. The distribution defined in (9) allows us to compute easily both truncated marginals and the joint distribution.

Model 6. Let  β ∈ N. Assume that

image

where (X(1), X(2)) ∼ U⊗2β, X(3) ∼ U([0, 1])is independent of (X(1), X(2)), and ε is anindependent noise distributed as  N(0, σ2).

In the framework of Model 6, Proposition 4 below states that the CART-split criterion has an explicit expression and highlights that splits along positively correlated variables (X(1) and X(2)) are more likely to occur compared to splits along independent ones (X(3)). Even if the model is very simple, it is the first theoretical proof that CART splitting procedure tends to favor positively correlated variables. Note however that considering negatively correlated variables will result in an opposite effect, i.e. a tendency to favour independent variables compared to negatively correlated ones.

Proposition 4. Assume that Model 6 holds. Then, the following statements hold:

(i) The splitting criterion has an explicit expression. For  β = 0, . . . , 5, each split is performed at the center of the support of the chosen variable.

image

Statement (ii) in Proposition 4 proves that a positive correlation between two variables increases the probability to split along one of these two variables. Indeed, in our particular model, the marginal effect of  X(3) on the output must be at least twice the effect of  X(1) or X(2) in order to split along  X(3).It is likely that the way the correlation impact the variable selection in trees depends both on the sign of the correlation and on the signs of coefficients in the linear model. Nevertheless, Proposition 4 proves that input dependence has an influence on variable selection in CART procedures. Theorem 6 provides the limiting MDI values given by theoretical and empirical CART trees, in Model 6.

Theorem 6. Let  β ∈ {0, . . . , 5}. Assume that Model 6 holds. Let  T ⋆ be any theoretical CART tree and  Tnbe the empirical CART tree. Then, for all  γ > 0, there exists K such that, for all k > K,

image

Additionally, for all  γ > 0, ρ ∈ (0, 1], there exists K such that, for all k > K, for all n large enough, with probability at least 1  − ρ,

image

Theorem 6 gives the limiting values for the MDI. Unfortunately, since  X(1) andX(2) are correlated, it is only possible to have information on the sum of the two MDI of  X(1) and X(2) rather than on each one of them. According to Lemma 3, a simple calculation shows that the sum of importances of  X(1) and X(2) satisfies,

image

Therefore, the group of variables  X(1) and X(2) has a larger importance because of their positive correlation. In this case, the amount of information provided by the two variables is larger than that provided by each one of them. This is a very important difference compared to the independent additive case, in which the sum of contributions of each variable is equal to the contributions of all variables. Here, even without interaction effects, this property does not hold, which leads to larger MDI for variables that are positively correlated. As mentioned before, the opposite effect would hold if the variables had a negative correlation or if they had coefficients of opposite signs in the linear model. The impact of correlation has been thoroughly investigated empirically for MDI Nicodemus and Malley [2009] and MDA Strobl et al. [2008] but we have theoretically shown above under Model 6 that correlation changes the MDI.

Even if the limiting value MDI⋆T ⋆K(X(1)) + MDI⋆T ⋆K(X(2)) is known, we have no in- formation on how this quantity is split between variable  X(1) and X(2) inside the tree T ⋆K. Indeed, we can find two theoretical trees which produce very different MDI as stated in Lemma 4.

Lemma 4. Let  β ∈ {0, . . . , 5}. Assume that Model 6 holds. Then, there exists two theoretical trees  T1 and T2 such that

image

Even in the case of a simple linear model with correlated input, it is not wise to use the importance of a single tree to measure the impact of this variable on the output, since this measure can vary drastically between two different trees. Instead, one should rather use an average of this measure across many shallow trees, hoping that randomizing the eligible variables for splitting ensure enough tree diversity in the forest, which in turn, leads to more balanced variable importances.

The proof of Lemma 4 relies on exhibiting a tree whose early splits are made along X(1) only, until X is uniformly distributed in each cell. For this tree, the importance of  X(1) is larger than the importance of  X(2) Since X(1) and X(2) have symmetric roles, one can also build a tree whose early splits are made along  X(2) only. The difference of importance for  X(1) between these two trees can be computed exactly, thanks to statement (i) and (ii) in Proposition 4.

Acknowledgements. We greatly thank the associate editor and the two referees for their careful reading and numerous insightful comments that strongly helped to improve the quality and the readability of the manuscript.

Proof of Proposition 1. Let  T ⋆ be a theoretical CART tree and define, for all  k ∈ N,the variance  ukof the tree  T ⋆k (the truncation of T at level k) as

image

where  ncell,kstands for the number of terminal nodes in  T ⋆k and {Aℓ,k, ℓ = 1, . . . , ncell,k}is the set of terminal nodes in  T ⋆k . Let fk(x) = V[Y |X′ ∈ Ak(x)]. By definition of the impurity measure, note that, for all k, the quantity  uk − uk+1corresponds to the reduction of impurity between level k and level k+1. Therefore, the impurity measure rewrites, for any  K ∈ N as

image

where  u0 = V[Y] by definition. Additionally,

image

yielding the desired conclusion

image

Proof of Corollary 1. Using notations of the proof of Proposition 1, for all x  ∈ [0, 1]d,fk(x) ≤ 4∥m∥2∞ + σ2and, by assumption, almost surely, limk→∞ fk(X) = σ2. Accord-ing to the dominated convergence theorem,

image

Therefore,

image

Proof of Proposition 2. Equation (11) in the proof of Proposition 1, written with empirical quantities, leads directly to the first statement.

Proof of Theorem 1.

Uniform convergence of theoretical trees. Note that a tree is nothing but a sequence of splits. Therefore, any tree T can be seen as a sequence (vn) such that • (v1, . . . , vd) represents the first split of the tree defined, for a split occurring along variable j at position s, as

image

• Each block (vmd+1, . . . , v(m+1)d) represents the mth split of the tree occurring along variable j at position s defined as

image

• Splits are ordered arbitrarily.

• In order to obtain a complete tree, we associate with all nodes whose construction is stopped, the split (0, . . . , 0). Since the split position belongs to [0, 1], we have �∞ℓ=1 v2ℓ < ∞. Besides,

image

For any tree T represented by the sequence (vn), we have

image

with  B(0, π2/6) being a compact of  ℓ2(N). Now, let  γ > 0 and

image

Set  A∞,γ = ∩∞k=0Ak,γ. Since the population CART-split criterion  L⋆, defined in (4) is non negative, we have

image

Therefore, for all  k, Ak+1,γ ⊂ Ak,γ. Because of the uniform continuity of the splitting criterion (continuous on a compact), for all  k ∈ N, the sets Ak,γare closed.

Assume that for all  k, Ak,γ ̸= ∅. We know that  B(0, π2/6) is compact and that, for all  k, Ak,γis a closed subset of  B(0, π2/6). Then, according to Cantor’s intersection theorem  A∞,γ ̸= ∅. Consequently, there exists  T ⋆ ∈ A∞,γ. By definition of  A∞,γ, forall k,

image

Again, by definition of  A∞,γ, T ⋆ is a theoretical tree, which is consistent by assumption. Consequently, there exists K such that, for all k > K,

image

which is absurd, according to equation (13). Therefore, there exists K > 0 such that AK,γ = ∅. All in all, we proved that, for all  γ >0, there exists K > 0 such that, for all theoretical tree  T ⋆, for all k > K,

image

Contribution of cells with small measure. Fix  γ and take Ksuch that inequality (14) holds. For any theoretical tree  T ⋆K of level K, the MDI associated with this tree is defined as

image

where the sum ranges for any node A in the tree,  µ(A) is the measure of the cell A defined as

image

where f is the density of X and  L⋆A the value of the population CART-split criterion at the best split of the cell A. Similarly, we let

image

where  Tn,Kis the empirical CART tree truncated at level K, A a node of the tree, µn(A) the empirical measure of  A and Ln,Athe empirical splitting criterion in cell A. Let MDI⋆T ⋆K,γ and �MDITn,K,γbe computed by removing the variance (empirical or population version) associated to cells  A for which µ(A) < γ. Define also MDI⋆Tn,K,γas the population version of the MDI but computed with the empirical partition  Tn,Kwhere the cells  A for which µ(A) < γhave been removed. Now, we want to control

image

Since X admits a density on [0, 1]d, the population CART-split criterion varies continuously as a function of the cell on which it is computed. Thus, the third term in (15) is controlled as soon as the empirical partition is close to the theoretical partition. Lemma 3 in Scornet et al. [2015] can be easily extended for any continuous regression function m and any density which is bounded from above and below by some positive constants. Therefore, with probability 1  − ρ, for all nlarge enough,

image

Last term in (15) can be treated directly by noting that the theoretical criterion is bounded above by 4∥m∥2∞. Therefore, we have

image

To control the second term, we make use of the results of Wager and Walther [2015], in particular Corollary 11 which can be adapted to a Gaussian noise by noting that with probability 1  − ρ, max1≤i≤n |εi| ≤ Cρ log n. Note that our setting is simpler than that of Wager and Walther [2015] where the dimension d grows to infinity with n. Therefore, with probability at least 1  − ρ, for all nlarge enough,

image

Regarding the first term in (15), for any cell  A such that µ(A) < γ,

image

With probability 1−ρ, for all nlarge enough, max1≤i≤n |εi| ≤ Cρ log n and, if nµn(A) ≥ √n,

image

Therefore, with probability 1  − ρ, for all nlarge enough,

image

Injecting inequalities (16)-(19) into (15), with probability 1−ρ, for all nlarge enough,

image

Since,

image

we conclude, from (14), (20) and Corollary 1, with probability 1  − ρ, for all n largeenough,

image

Since  ρ and γare arbitrary, this concludes the proof.

image

Proof of Lemma 1. Let  A = �dj=1[aj, bj] ⊂ [0, 1]dbe a generic cell. Recall that the regression model satisfies

image

For all  j ∈ J, and all splitting position s, define the two resulting cells

image

where, for notation brevity, the dependence on j and s is omitted. The variance reduction associated to any split (j, s) is defined as

image

Since the split is performed along the jth coordinate,

image

Consequently, the variance reduction satisfies

image

Since X(J ) and X(J ) are independent, letting  A(J ) = �j∈J [aj, bj] be the projection of the cell A along coordinates in J , we have

image

Therefore, for all  j ∈ J, the criterion  L⋆A(j, s) does not depend on  mJ cand is equal to its restriction over the cell  A(J ). Besides, according to equation (22), any split  j ∈ Jdoes not change the variance of  mJ c.

image

Proof of Proposition 3. Let x  ∈ [0, 1]d such that

(i) any theoretical tree  T ⋆k,J built on the distribution (X(J ), mJ (X(J ))) is consistent, that is lim k→∞ ∆(mJ , A(J )T ⋆k,J (x(J ))) = 0.

(ii) any theoretical tree  T ⋆k,J cbuilt on the distribution (X(J c), mJ c(X(J c))) is consis- tent, that is lim k→∞ ∆(mJ c, A(J c)T ⋆k,J c(x(J c))) = 0.

Consider a theoretical tree  T ⋆k built on the distribution (X, Y ). For all  k, let Ak(x) bethe cell in  T ⋆k containing x. For all k, we let

(i) Aφ(k)(x) be the subsequence of  Ak(x) composed of cells split by a coordinate in J , and A(J )φ(k)(x) the projection of  Aφ(k)(x) over J ,

(ii) Aψ(k)(x) be the subsequence of  Ak(x) composed of cells split by a coordinate in J c, and A(J c)ψ(k)(x) the projection of  Aψ(k)(x) over J c.

According to Lemma 1, the restrictions of the function  mJ over A ∈ {A(J )φ(k)+1(x),. . . , A(J )φ(k+1)(x)}are equal. Besides, for any  j ∈ J and s, the splitting criteria L⋆A(j, s) for A ∈ {A(J )φ(k)+1(x), . . . , A(J )φ(k+1)(x)}are the same. Therefore, the sequence A(J )φ(k)(x) is a sequence of cells resulting from a theoretical tree built on the distribution (X(J ), mJ (X(J ))). Similarly, the sequence  A(J c)ψ(k)(x) is a sequence of cells resulting from a theoretical tree built on the distribution (X(J c), mJ c(X(J c))).

Let  T ⋆J be any theoretical tree whose sequence of cells satisfies  AT ⋆k,J (x(J )) =A(J )φ(k)(x). Similarly, let  T ⋆J cbe any theoretical tree whose sequence of cells satisfies AT ⋆k,J c(x(J c)) = A(J c)ψ(k)(x). Then,

image

By assumption, and assuming that lim∞ φ, ψ = ∞, we have

image

Now, consider the case where lim∞ φ, ψ ̸= ∞. Since φ(N) ∪ ψ(N) = N, one can assume, without loss of generality, that lim∞ φ = ∞ and lim∞ ψ = p < ∞. Thus, there exists K such that for all k > K, splits in  AT ⋆k (x) are performed along coordinates in J only. Thus, by assumption, the corresponding theoretical tree  T ⋆k,J is consistent which implies that the splitting criterion  LAk(x)(j, s) tends to zero, for all  j ∈ J. This implies that the criterion  LAk(x)(j, s), for all j ∈ J c is equal to zero (otherwise, it would be larger than the criterion  LAk(x)(j, s) for j ∈ Jand consequently, splits would be performed along  j ∈ J c). This implies that the theoretical tree  T ⋆K,J cis fully grown, and by assumption, we have

image

which concludes the proof, by equality (24).

image

Proof of Theorem 2.

First statement. For any theoretical CART tree  T ⋆, we let, for all  k ∈ N,

image

and

image

We define similarly  uk,J c and fk,J c(x). By definition of the regression model, for all x  ∈ [0, 1]d,

image

where  fk(x) = V[m(X)|X ∈ Ak(x)]. By assumption this quantity tends to zero for almost all X  ∈ [0, 1]d. Besides, it satisfies  fk,J (x) ≤ 4∥m∥2∞. Therefore, according to the dominated convergence theorem, limk→∞ E[fk,J (X′)] = 0. Additionally,

image

Consequently,

image

The same result holds for  uk,J c. To conclude the proof, we just need to show that uk,J − uk+1,Jcorresponds to the variance reduction associated with variables X(J )between levels k and k + 1. Recalling that

image

the variance reduction between levels k and k + 1 is equal to

image

where

image

Each cell at level k +1 has been created (i) either by splitting a cell at level k into two cells, (ii) or by copying a cell at level k, if the cell is not split. In that latter case, the contribution of this cell to  uk,J − uk+1,Jis zero. Thus, we can rewrite equation (26) as

image

=

image

+

image

where  Aℓ,k+1,L and Aℓ,k+1,Rare the children of  Aℓ,k, the first sum (resp. the second) corresponds to the cell split along a coordinate in  J (resp. in J c), and nJ ,k is thenumber of split performed along a coordinate in J between level k and k + 1.

Note that, according to equation (21), all terms in the second sum in equation (27) are null. Finally, according to equation (22),  uk,J − uk+1,Jis exactly the sum of variance reduction for splits along coordinates in J that occur between level k and k + 1. Consequently, for all K, the population MDI computed with the theoretical tree  T ⋆K,

image

Hence, according to equation (25), for all theoretical tree  T ⋆

image

and similarly,

image

This proves the first statement of Theorem 2.

Second statement. To prove the second statement, we need to prove that the convergence of the population MDI in equations (28) and (29) holds uniformly over the set of all possible theoretical trees. Then, we need to extend the convergence to the empirical MDI computed with empirical CART trees. This can be done following the same reasoning as in the proof of Theorem 2.

image

Proof of Theorem 3. Theorem 3 is a direct application of Theorem 2, by noting that

image

in Theorem 2 results from Lemma 1 in Scornet et al. [2015].

To prove Theorem 4, we need the following Technical Lemma, whose proof is postponed to the end of this section.

Technical Lemma 1. Assume that

image

where  εis a centred noise of finite variance, independent of X. Thus, the theoretical splitting criterion in the cell  A ⊂ [0, 1]d satisfies

L⋆A(ℓ, s) = −E2[m(X)|X ∈ A] + P[X ∈ AL|X ∈ A]E2[m(X)|X ∈ AL]+  P[X ∈ AR|X ∈ A]E2[m(X)|X ∈ AR],

where AL = A ∩ {X(ℓ) < s} and AR = A ∩ {X(ℓ) ≥ s}.

Proof of Theorem 4. Let  A = �dℓ=1[aℓ, bℓ] ⊂ [0,1] be a generic cell. According to Lemma 1, the splitting criterion writes

image

For any j, the function  s �→ L⋆A(j, s) reaches its maximum at  s⋆ = (aj + bj)/2. Forthis particular value, we have

image

Thus the coordinate chosen for splitting in cell A must satisfy

image

Since the chosen coordinate satisfies equation (30) and the range of the cell along the chosen side is halved after the split, the diameter of any cell tends to zero, which proves that the theoretical tree is consistent. Hence Theorem 2 can be applied and leads to

image

since  X(j) ∼ U([0,1]). The convergence of the empirical MDI directly results from Theorem 2.

Proof of Theorem 5. Let  A = �dℓ=1[aℓ, bℓ] ⊂ [0,1] be a generic cell. According to Technical Lemma 1, the splitting criterion writes

image

For any  ℓ, the function  s �→ L⋆A(ℓ, s) reaches its maximum at  s⋆ = (aj +bj)/2. For this particular value, we have

image

The coordinate  ℓ⋆ is chosen for splitting if  L⋆A(ℓ⋆, s⋆) = max1≤k≤d L⋆A(k, s⋆), that is, for all k,

image

The variance reduction induced by the split (ℓ⋆, s⋆) in the cell  A = �dj=1[aj, bj] is equalto

image

Now, fix x  ∈ (0, 1]d. We want to prove that diam(Ak(x)) tends to zero as as k tends to infinity. For any cell  A, and all ℓ ∈ {1, . . . , d}, according to equation (31), we have

image

For all  k, let Ak(x) = �dj=1[aj,k, bj,k] be the cell containing x in the theoretical tree  T ⋆k .Let  u(j)k = bj,k −aj,k. Since, for all  j, (u(j)) is a positive decreasing sequence, we obtain that (u(j)) converges toward  γj ≥ 0. Let Γ = {j, γj > 0}and assume that Γ  ̸= ∅. Forall j,

image

Hence, there exists K such that for all  k > K, and all j ∈ J ,

image

In other words, after level k, all splits are performed along coordinates belonging to J c. Now, for all  j ∈ J c, limk→∞ u(j)k = 0, and thus, according to equation (33),

image

Therefore, there exists  K1 > K, such that, for all  k > K1,

image

Using again Equation (33), we deduce that, for all  k > K1,

image

thus the next split is performed along  j ∈ Jwhich is absurd. Hence, either  J = ∅ orJ c = ∅. If J c = ∅, given the expression of the variance reduction in equation (32), an additional split must occur after rank K, with an additional variance decreasing, which is forbidden by equation (34). Finally,  J = ∅and consequently, for all  j ∈ {1, . . . , d},lim k→∞ u(j)k = 0, which implies, for all x ∈ (0, 1]d,

image

Since X is uniformly distributed over [0, 1]d, and mis continuous, we have, almost surely,

image

A direct application of Theorem 1 yields

image

A direct calculation shows that  E[m(X)] = α, and E[m(X)2] = α2 (4/3)d, leadingdirectly to the conclusion. The last statement is a direct application of Theorem

image

Proof of Lemma 2. We prove the result in dimension two. The generalization in any dimension will directly follow. According to Theorem 5, in the root cell [0, 1]2, thevariable  ℓ⋆ selected for splitting must satisfy

image

where the right-hand term is equal to 1/2 for k = 1, 2. Therefore, the selected variable can be  X(1) or X(2). The corresponding decrease in variance is equal to  α2/4.

First theoretical tree. Consider the theoretical tree  T ⋆2,1 whose construction is stopped at level two, where the first split is performed along  X(1). The same calculation as above shows that splits must occur along the second variable which gives a variance decreasing of  α2/16 in the left cell, and 9α2/16 in the right cell. The corresponding partition is shown on the left of Figure 2. The variable importances for the tree  T ⋆2,1are

image

and

image

Second theoretical tree. Consider the theoretical tree  T ⋆2,2 whose construction is stopped at level two, where the first split is made along the second variable. Then the same reasoning as above shows that the second split must be made along the second variable, the resulting partition is displayed on the right of Figure 2. The variable importances for the tree  T ⋆2,2 are thus

image

and

image

Conclusion The partitions of the two theoretical trees  T ⋆2,1 and T ⋆2,2 defined above are the same but variable importances are not. Since the partitions are the same, one can choose any splitting strategy for the following splits and use this exact same strategy in both  T ⋆2,1 and T ⋆2,2. Since the splits occurring at level  k ≥2 are set to be the same, we have

image

image

Proof of Lemma 3. We have X  ∼ U

image

with

image

which leads to

image

Since we know that

image

the covariance can be expressed as

image

Since the marginals are uniform on [0, 1], we have  V[X(1)] = V[X(2)] = 1/12, whichyields,

image

Proof of Proposition 4. The first statement of Theorem 2 remains valid even when the density of X is not lower bounded by some positive constant. Besides, using Technical Lemma 1, the splitting criterion for Model 6, computed for a split along the first coordinate writes

image

Let us fix a split at position s along variable  X(1) and let ℓ = ⌊2βs⌋. The conditional density of  X(2) given that (X(1), X(2)) ∈ [0, s] × [0, 1] is

image

where, in order for  fX2|X1∈[0,s]to be a density, C must satisfy

image

that is  C = 2β/s. Hence,

image

and similarly,

image

To make explicit the splitting criterion in equation (35), we need to compute, on one hand,

image

and on the other hand,

image

Injecting equations (36) and (37) into (35), we get

image

The second term in equation (38) can be expressed as

image

and third term in equation (38) as

image

which, after several simplifications, leads to

image

Thus we have

image

with

image

We know that the function  s �→ L⋆(1, s) is symmetric around 1/2 and we conjecture that it is increasing on [0, 1/2]. Note that, we have  L⋆(1, 1/2) = 1/4. For all β, letmβ = L⋆(1, 1/2 − 1/2β). Setting a regular grid of size

image

image

Figure 4: Illustration of the splitting criterion  L⋆(1, s) for β = 0 (top left), β = 1 (topright),  β = 2 (bottom left), β = 3 (bottom right).

if all evaluations of  L⋆(1, s) on the previous grid are lower than  mβthen the maximum of  s �→ L⋆(1, s) is reached in [1/2 − 1/2β, 1/2 + 1/2β]. An numerical optimization with the function optimize of R [see Husmann et al., 2017, R Core Team, 2013] shows that it is true for  β = 1, . . . , 5. Figure 4displays the splitting criterion for  β = 0, . . . , 3. On

the interval [1/2 − 1/2β, 1/2), the partial derivative writes

image

Using equation (39) and the fact that ∂L⋆(1,1)∂s = 0, we have

image

where the discriminant of the second order expression in equation (40) is, for  β ≥ 1,

image

Thus the sign of the expression in the right-hand side of equation (40) does not vary on [0, 1/2) and is positive in zero. Thus, the splitting criterion is increasing on [1/2 −1/2β, 1/2).Since we know with the optimization procedure that the maximum is reached in this interval, it is reached for s = 1/2. This proves the first statement of Proposition 4.

Regarding the second statement of Proposition 4, note that the above calculations hold also for a split along the second variable. According to Theorem 4, the variance reduction associated to a split performed along  X(3) equals α2/16. Thus, for  β =0, . . . , 5, a split will occur along the first or second variable if, and only if

image

that is  |α| < 2.

image

Proof of Theorem 6. The distribution of  X(3) in each cell is still uniform. Assume that  |α| <2, so that the first split is performed on (without loss of generality) the first variable. Then the resulting left cell is  A = [0, 1/2]×[0, 1]2. For β >1, the conditional distribution of (X(1), X(2)) on Ais given by

image

On this cell, the regression model still writes

image

where

image

Therefore, the previous reasoning applies and a split is performed on the first or second variable in this cell if, and only if,

image

that is  |α| <1. This holds until each cell has been cut  βtimes along the first or second coordinate. In the resulting cells, the distribution of X is uniform on its support which is an hyperrectangle, and, according to Theorem 4, the variance of the regression function decreases to zero as a function of the tree level k. The previous reasoning shows that the sequence of splits along  {X(1) or X(2)} and X(3) is deterministic and depends only of the coefficient  α. The fourth statement, that is the convergence of the empirical MDI directly results from the convergence of theoretical trees proved above, and from a direct application of Theorem 2.

image

Proof of Lemma 4. According to Theorem 2, splits along  X(3) has no impact on the variance of  X(1) + X(2). One can thus assume, without loss of generality that the splits of the first  βlevels are all performed along the first and the second variables. According to the proof of Theorem 4, in each cell of the first  βlevels, the splitting criterion is the same along variable  X(1) and X(2). Consider the tree  T ⋆1 where allsplits in the first  βlevels are made along the first variable. Then, the total decrease of variance associated to all these splits is

image

In each cell that results from the first  βcuts, the distribution of (X(1), X(2)) is uniform. Then the decrease of variance in all these cells is the same for  X(1) and X(2). Hence,

image

Let  T ⋆2 be the exact same tree as  T ⋆1 but where all the first  βsplits are performed along the second variable. As before, we have,

image

According to Theorem 4, we know that

image

Finally, we have

image

Proof of Technical Lemma 1. With the notations of Technical Lemma 1, we have

image

since the noise  εis independent of X and the probabilities  P[X ∈ AL | X ∈ A] andP[X ∈ AR | X ∈ A] sum to one. Therefore,

image

where

image

which gives the conclusion by injecting (42) into (41).

S. Arlot and R. Genuer. Analysis of purely random forests bias. arXiv:1407.3939, 2014.

C. B´enard, S. Da Veiga, and E. Scornet. MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA. arXiv preprint arXiv:2102.13347, 2021.

G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063–1095, 2012.

A.-L. Boulesteix, A. Bender, J. Lorenzo Bermejo, and C. Strobl. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Briefings in Bioinformatics, 13:292–304, 2011.

L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

L. Breiman. Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA, 1:58, 2002.

L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton, 1984.

M. Fern´andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1):3133–3181, 2014.

R. Genuer, J.-M. Poggi, and C. Tuleau. Random forests: some methodological insights. arXiv preprint arXiv:0811.3619, 2008.

R. Genuer, J.-M. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recognition Letters, 31:2225–2236, 2010.

B. Gregorutti, B. Michel, and P. Saint-Pierre. Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis, 90:15–35, 2015.

B. Gregorutti, B. Michel, and P. Saint-Pierre. Correlation and variable importance in random forests. Statistics and Computing, 27(3):659–678, 2017.

U. Gr¨omping. Variable importance in regression models. Wiley Interdisciplinary Reviews: Computational Statistics, 7:137–152, 2015.

T.J. Hastie and R.J. Tibshirani. Generalized additive models. Routledge, 2017.

K. Husmann, A. Lange, and E. Spiegel. The R Package optimization: Flexible Global Optimization with Simulated-Annealing, 2017.

H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537, 2007.

H. Ishwaran. The effect of splitting on random forests. Machine Learning, pages 1–44, 2013.

J.M. Klusowski. Analyzing cart. arXiv preprint arXiv:1906.10086, 2019.

J.M. Klusowski. Sparse learning with CART. arXiv preprint arXiv:2006.04266, 2020.

X. Li, Y. Wang, S. Basu, K. Kumbier, and B. Yu. A debiased mdi feature importance measure for random forests. In Advances in Neural Information Processing Systems, pages 8049–8059, 2019.

A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3): 18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.

W.-Y. Loh. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):14–23, 2011.

G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable impor- tances in forests of randomized trees. In Advances in Neural Information Processing Systems, pages 431–439, 2013.

L.L. Nathans, F.L. Oswald, and K. Nimon. Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research & Evaluation, 17 (9), 2012.

K.K. Nicodemus. Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Briefings in bioinformatics, 12 (4):369–373, 2011.

K.K. Nicodemus and J. D. Malley. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics, 25(15):1884–1890, 2009.

A. Poterie, J.-F. Dupuy, V. Monbet, and L. Rouviere. Classification tree algorithm for grouped variables. Computational Statistics, 34(4):1613–1648, 2019.

Y. Qi. Ensemble Machine Learning, chapter Random forest for bioinformatics, pages 307–323. Springer, 2012.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www. R-project.org/.

B. Ramosaj and M. Pauly. Asymptotic unbiasedness of the permutation importance measure in random forest models. arXiv preprint arXiv:1912.03306, 2019.

A.C. Rencher and G.B. Schaalje. Linear models in statistics. John Wiley & Sons, 2008.

G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. H. Torr. Randomized trees for human pose detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.

E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43:1716–1741, 2015.

S.R. Searle and M.H.J. Gruber. Linear models. John Wiley & Sons, 2016.

C.J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, pages 689–705, 1985.

C. Strobl and A. Zeileis. Danger: High power!–exploring the statistical properties of a test for random forest variable importance. Technical report, University of Munich, Department of Statistics, 2008.

C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8 (1):25, 2007.

C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9:307, 2008.

C. Strobl, T. Hothorn, and A. Zeileis. Party on! A new, conditional variable impor- tance measure for random forests available in the party package. Technical report, University of Munich, Department of Statistics, 2009.

A. Sutera, G. Louppe, V.A. Huynh-Thu, L. Wehenkel, and P. Geurts. Contextdependent feature analysis with random forests. Uncertainty In Artificial Intelligence: Proceedings of the Thirty-Second Conference, 2016.

S. Wager and G. Walther. Adaptive concentration of regression trees, with application to random forests. arXiv preprint arXiv:1503.06388, 2015.

M.N. Wright, A. Ziegler, and I.R. K¨onig. Do little interactions get lost in dark random forests? BMC bioinformatics, 17(1):1–10, 2016.

Z. Zhou and G. Hooker. Unbiased measurement of feature importance in tree-based methods. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2): 1–21, 2021.

R. Zhu, D. Zeng, and M.R. Kosorok. Reinforcement learning trees. Technical Report, University of North Carolina, Chapel Hill, 2012.


Designed for Accessibility and to further Open Science