A Characterization of Mean Squared Error for Estimator with Bagging

2019·Arxiv

Abstract

Abstract

Bagging can significantly improve the generalization performance of unstable machine learning algorithms such as trees or neural networks. Though bagging is now widely used in practice and many empirical studies have explored its behavior, we still know little about the theoretical properties of bagged predictions. In this paper, we theoretically investigate how the bagging method can reduce the Mean Squared Error (MSE) when applied on a statistical estimator. First, we prove that for any estimator, increasing the number of bagged estimators N in the average can only reduce the MSE. This intuitive result, observed empirically and discussed in the literature, has not yet been rigorously proved. Second, we focus on the standard estimator of variance called unbiased sample variance and we develop an exact analytical expression of the MSE for this estimator with bagging. This allows us to rigorously discuss the number of iterations N and the batch size m of the bagging method. From this expression, we state that only if the kurtosis of the distribution is greater than MSE of the variance estimator can be reduced with bagging. This result is important because it demonstrates that for distribution with low kurtosis, bagging can only deteriorate the performance of a statistical prediction. Finally, we propose a novel general-purpose algorithm to estimate with high precision the variance of a sample.

1 Introduction

Since the popular paper [1], bootstrap aggregating (or bagging) has become prevalent in machine learning applications. This method considers a number N of samples from the dataset, drawn uniformly with replacement (bootstrapping) and averages the different estimations computed on the different sub-sets in order to obtain a better (bagged) estimator. In this article, we theoretically investigate the ability of bagging to reduce the Mean Squared Error (MSE) of a statistical estimator with an additional focus on the MSE of the variance estimator with bagging.

In a machine learning context, bagging is an ensemble method that is effective at reducing the test error of predictors. Several convincing empirical results highlight this positive effect [2, 3]. However, those observations cannot be generalized to every algorithm and dataset since bagging deteriorates the prediction performance in some cases [4]. Unfortunately, it is difficult to understand the reasons explaining why the behavior of bagging differs from one application to another. In fact, only a few articles give theoretical interpretations [5, 6, 7]. The difficulty to obtain theoretical results is partially due to the huge number of bootstrap possibilities for the estimator with bagging. For instance, in [8, 9, 10], authors studied properties of bagging-statistic, which considers all possible bootstrap samples in bagging, and provided a theoretical understanding regarding the relationship between the sample size and the batch size. However, a bagging estimator differs from a bagging statistic and as explain in [8], due to the huge number of sampling if we want to use bagging statistic, in practice bagging estimators are used. These insufficient theoretical guidelines generally lead to arbitrary choices of the fundamental parameters of bagging, such as the batch size m and the number of iterations N.

In this paper, we investigate the theoretical properties of bagging applied to statistical estimators and specifically the impact of bagging on the MSE of those estimators. We provide three core contributions:

• We give a rigorous mathematical demonstration (proof) that the bias of any estimator with bagging is independent from the number of iterations N, while the variance linearly decreases in . This intuitive behavior, observed in several papers [1, 11], has not yet been demonstrated. We also provide recommendations for the choice of the number of iterations N and the batch size m. We further discuss the implication in a machine learning context. To demonstrate these points, we develop a mathematical framework which enables us, with a symmetric argument, to obtain an exact analytical expression for the bias and variance of any estimator with bagging. This symmetric technique can be used to calculate many other metrics on estimators, we hope that our findings will enable more research in the area.

• We use our framework to further study bagging applied to the standard estimator of variance called unbiased sample variance. This estimator is widely used and a lot of work has been done on specific versions, such as the Jackknife variance estimator [12] or the variance estimator with replacement [13]. In this paper, we derive an exact analytical formula for the MSE of the variance estimator with bagging. It allows us to provide a simple criteria based on the kurtosis of the sample distribution, which characterizes whether or not the bagging will have a positive impact on the MSE of the variance estimator. We find that on average, applying bagging to the variance estimator reduces the MSE if and only if the kurtosis of

the sample distribution is above and the number of bagging iterations N is large enough. From this theorem, we are able to propose a novel, more accurate variance estimation algorithm. This result is particularly important because it describes explicit configurations where the application of bagging cannot be beneficial. We further discuss how this result fits into common intuitions in machine learning.

• Finally, we provide various experiments illustrating and supporting our theoretical results.

The rest of the paper is organized as follows. In Section 2 we present the mathematical framework along with our derivation of the MSE for any bagged estimators. In Section 3, we focus on the particular case of the bagged unbiased sample variance estimator and we provide a new criteria on the variable kurtosis. In Section 4, we support our theoretical findings with three experiments.

2 Bagged Estimators: The More The Better

In this section we rigorously define the notations before presenting the derivation of the different components of the MSE of a bagged statistical estimator, namely the bias and the variance. We subsequently deduce our first contribution: the more iterations N used in the bagging algorithm, the smaller the MSE is, with a linear dependency.

2.1 Notations, Definitions

A dataset, denoted , is the realization of an independent and identically distributed sample set, noted , of a variable A statistic of the variable X is noted . An estimator of , computed from

By considering , a random function, we denote a uniform sampling with replacement from L of size m. The random variable U is defined on , the finite set of all m sized sampling with replacement, of cardinal . The bagging method considers N such sampling functions taken uniformly from can now define the bagging estimator:

Usually, the size of the sample sets is n. Here, we consider the general case where is a function of a set of any size m. The number N of iterations is often set to represents all the natural numbers between a and b) without rigorous justification [1, 14, 15]. We discuss this parameter in the following section.

The average with respect to L is noted , and with respect to B is noted . We assume that . Since B is defined on a finite set, we define which is the expected value taken with respect to the pair of random variables (L, B) and we have

Therefore, there are possibilities to draw N sampling functions taken uniformly from ) and at L fixed, the average over B takes the form:

2.2 MSE of Estimators with Bagging

In this subsection we state the theorem on the dependence of the bagged estimators in terms of the number of iterations N. The central idea of the proof is to count all the possible bagged estimators and to use a symmetric argument. The proof of Theorem 2.1 can be found in the supplementary material. We eventually adapt this framework to the case of regression predictors.

Theorem 2.1. For a general statistic , there exists two positive terms F and G, independent of N, such that the MSE of the bagged estimator, defined by (1), satisfies:

with

are positive and do not depend on N. We deduce that:

2.3 Bagging for Regression

In the regression setup, the variable X considered is a pair (input, label), X = (Y, T) from . For any input , the statistic considered is . The prediction is noted and the MSE of this estimator represents the prediction error that needs to be reduced. The bagging version of the estimator is noted

The MSE of the bagged predictor is

According to Theorem 2.1, for every y, there exists two positive values

independent of N, such that:

Therefore,

We deduce that on average, increasing the number of iterations N can only reduce the average of the MSE of the bagged estimator in a regression setting.

In this section, we demonstrated that the MSE of an estimator with bagging is a linear function of (). The multiplicative component of the dependency, F, is positive. Thus, increasing the number of iterations N can only improve the accuracy of the estimator. This fact has been observed in several empirical studies [1, 16], but to our knowledge has never been rigorously proved. G represents a lower bound on the MSE of the estimator with bagging. This means that a sufficient N ensures that the term is negligible compared to G. Moreover, if G is bigger than the MSE of the estimator without bagging, then bagging only deteriorates the precision of the estimator. This deterioration was observed empirically, but the reasons had not yet been theoretically explained [4, 17]. This result holds for machine learning algorithms in the regression setup as well. In the following section, we apply this general framework to the specific case of the unbiased sample variance estimator.

3 Sample Variance Estimator

In this section, we study the specific case of the bagged unbiased sample variance estimator. Applying the above framework on this particular case, we are able to deduce precise constraints under which using bagging improves on average the MSE of the unbiased sample variance estimator. We derive from this result, given in Theorem 3.3, a criteria expressed in terms of the kurtosis of the sample distribution, the batch size m and the number of iterations N. We eventually propose at the end of the section, a novel algorithm which provides a more accurate variance estimation if those criteria are met. Carrying the notation of the precedent section, we assume without loss of generality that E(X) = 0 by taking . We denote the second moment of the centered random variable and the forth moment. Since X is centered, is the variance of is the kurtosis of X. In the rest of the section variance of X and we note . The version of this estimator with bagging follows definition (1) and is noted

3.1 Bias and Variance of the Sample Variance Estimator with Bagging

The following theorem gives an exact expression of the bias and variance of the variance estimator with bagging.

Theorem 3.1. Let X be a random variable with finite fourth moment and satisfying E(X) = 0. The bagged variance estimator with batch size m and number of bagging iteration N satisfies:

where

and

Moreover, the MSE of the variance estimator with bagging is:

The proof of Theorem 3.1 can be found in the supplementary material.

3.2 Asymptotic Analysis and Comparison of Estimators

In this section, we analyze the asymptotic behavior of the bagged estimator with respect to the parameters m and N. We also compare the bagged estimator and the non-bagged estimator. Recall that for the standard variance estimator, the MSE is known(see [18]).

Proposition 3.2. Assuming that X has a finite fourth moment, the variance of the standard sample variance estimator is given by:

Since this estimator is unbiased, it holds:

Proposition (3.2) combined with Theorem 3.1 enables us to compare the MSE of and we find that:

We state the following theorem.

, bagging reduces on average the MSE of the variance estimator if and only if:

and

Therefore, bagging should be used when (3) and (4) are satisfied. Thus N and m should be carefully chosen. Moreover, the gain obtain by using bagging is in . The equation (3) can be rewritten as the kurtosis of X. Remark that if

We deduce that the number of iterations N should be at least n/2 in this case. As previously mentioned, the number of iterations N of bagging is often chosen as without theoretical justifications [19]. Theorem 3.3 gives, for the sample variance estimator, a minimum number of iterations N above which bagging improves the estimation.

Note that the condition on the kurtosis () is not restrictive. In fact, many classical continuous distributions satisfy this condition. For example, for a normal distribution, for a logistic distribution, for a uniform distribution, for an exponential distribution. We now propose a novel algorithm to estimate the variance of a sample.

3.3 Algorithm for Higher Precision Variance Estimation

We deduce from the preceding theoretical results a simple algorithm to estimate the variance of a sample with higher precision. To simplify the procedure, we choose to take the batch size m equal to the full sample size n. Let q, an integer greater or equal to 1, denote a parameter used to control the quality of the resulting estimation. The higher the q, the higher N; and as a result of the Theorem 2.1 and Theorem 3.3, the better the estimation is (at the cost of an increasing computation time). We suggest taking is the floor function. The reader can find an analysis of the algorithmic complexity in the supplementary material.

4 Experiments

In this section we present various experiments illustrating and supporting the previous theoretical results. The section begins with experiments on general estimator, with a focus on regression estimators as presented in section 2.3. Then, we present empirical experiments about the bagging version of the sample variance estimator, supporting our theoretical results of section 3. For the experiments, we used Python and specifically the Scikit-Learn [20], Numpy [21], Scipy [22] and Numba [23] libraries. More details on the experimental setup can be found in the supplementary material.

4.1 Variance of Bagged Estimators in Regression

In order to empirically validate Theorem 2.1 we performed experiments on two regression predictors: the linear regression and the decision tree regression. In each case, we measured the MSE on the test set after training each model on bagged samples, varying the parameter N of bagging iterations. We generated toy regression datasets using the Scikit-Learn library. Each dataset, has a specific amount of Gaussian noise () applied to the output (0.5 or 5). The number of samples (1000) and the size of the feature space dimension (2) remained constant. To obtain the bagged and non-bagged predictors, we trained the regressors on 5% of the data and tested it on the remaining 95%. This partition was chosen to enable a fast training of the growing number of bagged predictors. We are interested in showing empirically that our theoretical relation holds, not to achieve good absolute results. Using the notations from 2.3: the average taken over the test set of size was taken over 100 trials and was taken over 10 trials. As expected, we observe that the MSE of those estimators decreases at a rate of . We fit a non-linear function on the resulting datapoints of the form to better observe this relationship. More details on the experiment setup and hyperparameters used can be found in the supplementary material.

Figure 1: Linear Regression (Top-Left and Bottom-Left) and Regression Tree (Top-Right and Bottom-Right) Predictors. We observe the convergence rate that we demonstrated analytically. We also observe that for the Linear Regression, considered as a stable predictor, bagging does not help. On the other hand, for the Decision Tree Regression, considered as an unstable predictor (following the definition of [1], predictors are unstable if a small change in the training set can result in large changes in predictions), bagging helps.

4.2 MSE of Bagged and Non-Bagged Unbiased Variance Estimator

In this experiment, we test the empirical validity of Theorem 3.3. Our condition on the kurtosis comes from the following equality that we proved in the supplementary material: Here we set m = n. We measured the distance between the MSE of a bagged and non-bagged unbiased sample variance estimator for three classical probability distributions: a standard Gaussian distribution, a uniform distribution on and a Rademacher distribution. We fixed the number of iterations N to 50, only varying the sample size n, and averaged the estimated variance over 10000 trials. As expected, it follows the same shape as our condition divided by

4.2.1 Kurtosis Condition

In this experiment, we tested the kurtosis condition . Distributions with kurtosis lower than are unusual as previously mentioned. We designed a distribution whose kurtosis can vary above and below when the parameter p is changing. Let X be a random variable following this distribution. With

The kurtosis of this distribution is thus:

Varying the parameter p between 0 and 1, the kurtosis varies from Figure 3 shows the MSE with and without bagging accordingly. We fixed N to , and we averaged over 100000 trials. As expected, The MSE of the estimator with bagging becomes better than the MSE of the estimator without bagging when the threshold is passed.

5 Conclusion

In this paper, we theoretically investigate the MSE of estimators with bagging. We prove the existence of a linear dependency of the MSE on , N being the number of bagged averaged estimators. This dependency enables us to provide guidelines for setting the parameter N. We also explain why bagging is detrimental in some cases. We use the mathematical framework developed Section 2 to describe the MSE for the sample variance estimator with bagging. It appears that the bagging can reduce the MSE of this estimator if and only if the kurtosis of the distribution is above . This condition holds for a large

Figure 2: Usual Distribution Experiments

number of classical probability distributions. Using this condition, we eventually propose a novel algorithm for more accurate variance estimation.

Figure 3: The MSE of the variance estimator of a distribution with or without bagging with respect to the kurtosis. As expected, we observe that the MSE of the bagged variance estimator becomes lower than the MSE of the non-bagged variance estimator after

References

[1] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[2] Geoffrey I Webb and Zijian Zheng. Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering, 16(8):980–991, 2004.

[3] Jianchang Mao. A case study on bagging, boosting and basic ensembles of neural networks for ocr. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227), volume 3, pages 1828–1833. IEEE, 1998.

[4] Yves Grandvalet. Bagging equalizes influence. Machine Learning, 55(3):251– 270, 2004.

[5] Peter Bühlmann, Bin Yu, et al. Analyzing bagging. The Annals of Statistics, 30(4):927–961, 2002.

[6] Jerome H Friedman and Peter Hall. On bagging and nonlinear estimation. Journal of statistical planning and inference, 137(3):669–683, 2007.

[7] Andreas Buja and Werner Stuetzle. Smoothing effects of bagging: Von mises expansions of bagged statistical functionals. arXiv preprint arXiv:1612.02528, 2016.

[8] Andreas Buja and Werner Stuetzle. Observations on bagging. Statistica Sinica, 16(2):323, 2006.

[9] Songxi Chen and Peter Hall. Effects of bagging and bias correction on estimators defined by estimating equations. Statistica Sinica, 13:97–109, 2003.

[10] Andreas Buja and Werner Stuetzle. Smoothing effects of bagging: Von mises expansions of bagged statistical functionals. arXiv preprint arXiv:1812.08808, 2018.

[11] Luoluo Liu, Trac D Tran, et al. Reducing sampling ratios and increasing number of estimates improve bagging in sparse regression. arXiv preprint arXiv:1812.08808, 2018.

[12] Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statistics, pages 586–596, 1981.

[13] Eungchun Cho and Moon Jung Cho. Variance of sample variance with replacement. International Journal of Pure and Applied Mathematics, 52(1):43–47, 2009.

[14] Thomas G Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157, 2000.

[15] Aurélie Lemmens and Christophe Croux. Bagging and boosting classification trees to predict churn. Journal of Marketing Research, 43(2):276–286, 2006.

[16] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. In European Conference on Machine Learning, pages 323–334. Springer, 2007.

[17] Marina Skurichina and Robert PW Duin. Bagging for linear classifiers. Pattern Recognition, 31(7):909–930, 1998.

[18] C. Rose and M.D. Smith. Mathematical Statistics with Mathematica. Springer-Verlag, 2002.

[19] Alireza Fazelpour, Taghi M Khoshgoftaar, David J Dittman, and Amri Naplitano. Investigating the variation of ensemble size on bagging-based classifier performance in imbalanced bioinformatics datasets. In 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), pages 377–383. IEEE, 2016.

[20] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.

[21] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.

[22] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [Online; accessed <today>].

[23] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, page 7. ACM, 2015.

Supplementary Material: A Characterization of Mean Squared Error for Estimator with Bagging

1 Proof of Theorem 2.1

Recall that the estimator with bagging is given by:

where is a set of N sampling with replacement of the dataset , taken independently and uniformly from . Since we assumed that in a finite set, is well defined and the Fubini’s theorem is always available for interchange between different expectations.

Lemma 1.1. Let U be a random variable uniformly distributed over U. Then the expected value of the bagged estimator does not depend on N and is equal to

Proof. For N independent identically distributed bagged samples represented by , it follows by linearity of the expectation operator and Fubini’s theorem that:

Now we turn to the variance. Previous lemma 1.1 computes the expectations, therefore we only need to compute the second moment.

Lemma 1.2. The expected value of the bagged estimator satisfies:

where U is a random variable uniformly distributed over U and

and

Since is uniformly distributed on is a symmetric polynomial with respect to . Thus there exists two constants and such that for any function

Let

Now define , such that right hand side of equation (1) equals to and we need to compute the value of left hand side. Observe that

where is uniformly and independently distribute on U. Thus is a family of independent Bernoulli with parameter . The second moment of is thus binomial distributed with parameter . Recall that the second moment of a binomial distribution with parameter (N, p) equals to . Therefore:

Simplifying equation (1) we have:

Together with (2), we deduce :

Hence,

Now we are in position to prove theorem:

Proof of theorem. : According to lemma 1.1 and 1.2, the variance of the bagging estimator is:

Hence, the MSE of

2 Proof of Theorem 3.1

We recall that can be rewritten as:

We begin with some lemmas.

Lemma 2.1. Let X be a random variable with finite forth moment satisfying E(X) = 0. Let be independent copies of X and . Denote i, j, k, l four distinct index of {1, ..., n}. Then the following statements hold.

Proof. The proof of the three first items is immediate by independence of

According to theorem 2.1 , we need to compute and . We begin with

Lemma 2.2. Let X be a random variable with finite forth moment satisfying E(X) = 0. Let be independent copies of X and . Let U be a random variable uniformly distributed over U. Then the following equations hold:

Proof. Item 1. Since U is uniformly distributed on U, it holds

Remark that is a symmetric polynomial of , there exists a constant A such that

A is thus the coefficient of in both side. For , denote and the number of and the number of in the sample of bagging . Then the coefficient of in . Therefore,

Consider the function

Since G can also be rewritten as

On the other hand, uniformly pick a U out of U is equivalent to uniformly pick the value of independently. Hence,

where u is uniformly distributed on . Therefore

Together with equation (8), it holds:

Plugin into equation (7), we have

Recall that the standard variance estimator is an unbias estimator and can be written as:

Item 2. By equation (10) and a simple development,

Combining with lemma 2.1, it holds:

Item 3. We remark again is a symmetric polynomial function of and the degree of this polynomial is 4. Thus there exists constants such that:

The coefficients are the coefficients of and respectively. We use similar arguments as in the proof of item 1. Denote the number of in the sample of bagging . Again according to property of G, then it holds:

and

Therefore, with equation (9) and (13), we have

Item 3 follows with equation (14) and lemma 2.1.

Now we are ready to prove theorem 3.1 .

Proof of theorem 3.1 . According to theorem 2.1 and lemma 2.2, it holds

On the other hand, applying lemma 2.2 successively, we have:

and:

The proof is completed.

3 Proof of theorem 3.3

According to theorem 2.1 and proposition 3.2, it holds

On the other hand, simplifying equation (1) from theorem 2.1, we have:

We deduce therefore

We deduce theorem 3.3 by comparing the latter equation with 0.

4 Algorithm Complexity

The algorithm complexity decomposes as follows: The estimation of the kurtosis takes O(n), The estimation of each inner variance takes O(n), thus the estimation of the bagged variance takes O(Nn). The total complexity is thus O(n) + . Given that the condition must hold, we deduce that the complexity is in . Compared to traditional variance estimation in O(n), the complexity is thus greater. As a result, this algorithm can only be useful in cases where the sample size n is not too large.

5 Experimental Setup Details

For all experiments, whenever possible we optimized and parallelized our computation using the Python Numbalibrary. All the experiments were run on a single CPU 3,1 GHz Intel Core i5 with 8 GB of RAM.

5.1 First Experiment

To generate the datasets, we used the make_regression function from sklearn. Regarding the first experiment, we used the Numpy library to estimate the param- eters of the linear regression (linalg.lstsq function with default hyperparamtersFor the Decision Tree regression, we use the scikit-learn implementationwith default hyperparameters. To fit the non-linear curve on the estimated data points, we used the optimize.curve_fit function from the Scipy package

5.2 Second Experiment

For the second experiment, we essentially used the random module from the Numpy library

5.3 Third Experiment

For the third experiment, we used the rv_discrete module from the Scipy library to design a custom discrete distribution

designed for accessibility and to further open science