BoostTree and BoostForest for Ensemble Learning

2020·Arxiv

Abstract

Abstract

Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It increases the randomness (diversity) by drawing the cut-points randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest generally outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 35 classification and regression datasets. Remarkably, BoostForest tunes its parameters by simply sampling them randomly from a parameter pool, which can be easily specified, and its ensemble learning framework can also be used to combine many other base learners.

1 INTRODUCTION

ENSEMBLE learning [1], [2] trains multiple base learnersto exploit the relationship between a set of covariates (features) and a response (label), and then combines them to produce a strong composite learner with better generalization performance. It has been successfully used in biology [3], [4], [5], engineering [6], [7], [8], healthcare [9], [10], etc.

For example, in biology, Wang et al. [3] used an ensemble of neural networks to emulate mechanism-based biological models. They found that the ensemble is more accurate than an individual neural network, and the consistency among the individual models can indicate the error in prediction. In Moon exploration, Yang et al. [7] used ensemble transfer learning and Chang’E data for automatic Moon surface impact crater detection and age estimation. They successfully identified 109,956 new craters from 7,895 training samples. In healthcare, Agius et al. [10] developed a Chronic Lymphocytic Leukemia (CLL) treatment-infection model, an ensemble of 28 machine learning algorithms, to identify patients at risk of infection or CLL treatment within two years of diagnosis.

One of the most popular algorithms for constructing the base learners is the decision tree [11], [12], [13]. Two common approaches for constructing the composite learner are bootstrap aggregating (Bagging) and boosting.

Bagging [14] connects multiple base learners in parallel to reduce the variance of the ensemble. Each base learner is trained using the same learning algorithm on a bootstrap replica, which draws N (the size of the original training set) samples with replacement from the original training set. The outputs of these base learners are then aggregated by majority voting (for classification) or averaging (for regression) to obtain the final output. To achieve robust performance, the base learners in an ensemble should be both accurate and diverse [15], [16], [17].

There are many approaches to increase the accuracy of base learners in an ensemble. Combining the advantages of tree models and linear models can greatly improve the model’s learning ability, which is the main idea of model trees. M5 [18] constructs a linear regression function at each leaf to approximate the target function for high fitting ability. When a new sample comes in, it is first sorted down to a leaf, then the linear model at that leaf is used to predict its output. M5P (aka M5) [19] trains a linear model at each node of a pruned tree to reduce the risk of over-fitting. More sophisticated regression algorithms, e.g., Ridge Regression (RR) [2], Extreme Learning Machine (ELM) [20], Support Vector Regression (SVR) [21], and Neural Network [22], have rarely been used in model trees. A possible reason is that they have some hyper-parameters to tune, e.g., the regularization coefficient and the number of nodes in the hidden layers. It is an NP-complete problem to simultaneously determine the structure of the model tree and the parameters of each node model [23]. A common approach is cross-validation, but it is very time-consuming. Thus, it is desirable to develop a strategy that can make the model tree more compatible with these more sophisticated (and hence potentially better performing) regression models.

There are also many approaches to increase the diversity of base learners in an ensemble, which can be divided into three categories [24]: 1) Sample-based strategies, which train each base learner on a different subset of samples, and thus are scalable to big data. For example, Bagging uses bootstrap sampling to obtain different subsets to train the base learners, and AdaBoost [25] uses adaptive sample weights (larger weights for harder samples) in generating a new base learner. 2) Feature-based strategies, which train each base learner on different subsets of features, and thus are scalable to high dimensionality. For example, each decision tree in Random Forest (RandomForest) [26], [27], [28] selects the feature to be split from a random subset of features, instead of all available features. Similarly, each decision tree in Extremely Randomized Trees (Extra-Trees) [29] splits nodes by drawing the cut-points completely randomly. 3) Parameter-based strategies. If the base learners are sensitive to their parameters, then setting different parameters can improve the diversity. For example, different hidden layer weights can be used to initialize diverse neural networks. Interestingly, these three categories of diversity enhancement strategies are complementary; so, they can be combined for better performance.

Boosting [25], [30], [31], the driving force of Gradient Boosting Machine (GBM) [11], can be used to reduce the bias of an ensemble. It is an incremental learning process, in which a new base learner is built to compensate the error of previously generated learners. Each new base learner is added to the ensemble in a forward stage-wise manner. As the boosting algorithm iterates, base learners generated at later iterations tend to focus on the harder samples. Mason et al. [32] described boosting from the viewpoint of gradient descent and regarded boosting as a stage-wise learning scheme to optimize different objective functions iteratively. FilterBoost [33] is a variant of AdaBoost [25], which fits an additive logistic regression model and minimizes the negative log likelihood step-by-step. DeepBoosting [34] can use very deep decision trees as base learners. Popular implementations of GBM, e.g., XGBoost [12] and LightGBM [35], have been successfully used in many applications [36], [37], [38], [39]. However, traditional boosting approaches [11], [12], [35] often have many parameters and thus require cross-validation, which is unreliable on small datasets, and time-consuming on big data. It is desirable to develop an algorithm that has very few parameters to tune and is robust to them.

This paper proposes BoostForest, which integrates boosting and Bagging for both classification and regression. Our main contributions are:

1) We propose a novel decision tree model, BoostTree, that integrates GBM into a single decision tree, as shown in Fig. 1(a). BoostTree uses the node function, e.g., RR, ELM, or SVR, to train a linear or nonlinear regression model at each node for regression or binary classification, or J regression models for J-class classification, where J is the number of classes. For a given input, BoostTree first sorts it down to a leaf, then computes the final prediction by summing up the outputs of all node models along the path from the root to that leaf. Similar to Extra-Trees, BoostTree increases the randomness (diversity) by drawing the cut-points randomly at node splitting.

2) We propose a novel parameter setting strategy, random parameter pool sampling, which makes Boost-

Fig. 1. (a) BoostTree with four leaves. For a given input, BoostTree first sorts it down to a leaf q(x), then computes the final prediction by summing up the outputs of all node models along the path (given by ) from the root to the leaf. The parameters of BoostTree are randomly selected from a parameter pool. (b) BoostForest with K BoostTrees. Bootstrap is used to obtain K replicas of the training set.

Tree easier to tune its hyper-parameters than traditional approaches. In this strategy, the parameters of BoostTree are not specific values, but random samples from candidate sets stored in a parameter pool. Each time a root node is generated, BoostTree randomly selects its hyper-parameters from the parameter pool.

3) Using BoostTrees as base learners, we propose a novel ensemble learning approach, BoostForest, as shown in Fig. 1(b). It first uses bootstrap to obtain multiple replicas of the original training set, and then trains a BoostTree on each replica. BoostForest uses the parameter pool sampling strategy to simplify its hyper-parameter tuning process, and outperforms several popular ensemble learning approaches. Moreover, it represents a very general ensemble learning framework, whose base learners can be any tree model, e.g., BoostTree, M5P, ExtraTree (a single decision tree in Extra-Trees), or Logistic Model Tree (LMT) [40], or even a mixture of different models.

The remainder of this paper is organized as follows: Section 2 introduces some related work. Section 3 describes the details of our proposed BoostTree and BoostForest. Section 4 presents experimental results to demonstrate the superiority of BoostForest over several popular ensemble learning approaches on 35 classification and regression datasets. Finally, Section 5 draws conclusions and points out some future research directions.

2 RELATED WORK

GBM [11] generates an ensemble via an iterative process. It uses Newton’s method to decompose the original classi-fication or regression problem into multiple sub-regression problems, and solves them iteratively. Given a dataset D = with N training samples, where and D is the feature dimensionality, an ensemble generated by GBM uses K base learners to predict the output:

LogitBoost (Supplementary Algorithm 1) [41] is a popular implementation of GBM to optimize logistic regression. In each iteration, LogitBoost first computes the pseudo-labels and weights using the Newton (for binary clas-sification) or quasi-Newton (for multi-class classification) method, and then updates the ensemble by adding a new regression model (for binary classification) or J new regression models [for J-class (J > 2) classification], which are trained to fit the pseudo-labels by weighted least-squares regression. If the original problem is regression, then in each GBM iteration, the pseudo-label is the residual between the true value and the prediction, and all samples have the same weight.

LogitBoost further converts the GBM output into a probability:

where

in which is the output vector for J-class (J > 2) classification.

Malerba et al. [42] proposed Stepwise Model Tree Induction (SMOTI) to construct model trees gradually for regression. When the tree is growing, SMOTI adds more and more variables to the leaf model to approximate the target function. At each step, it adds either a splitting node or a regression node to the tree. A splitting node partitions the sample space, whereas a regression node removes the linear effect of the variables already included in its parent node and performs linear regression on one variable.

Different from GBM, LMT [40] generates only one tree instead of multiple trees for predicting. LMT extends model trees from regression to classification, by integrating LogitBoost into the tree model. The final model at a leaf consists of all node models along the path from the root to that leaf.

Assume an LMT has trained a function for the m-th node. For an input , LMT first identifies , the leaf node it belongs to, and then all along the path from the root to that leaf node are summed up to predict the output, i.e.,

where is the collection of the node indices along the path from the root to the leaf node . Similar to LogitBoost, LMT can use (2) to output the probability.

SimpleLogistic [40] (a variant of LogitBoost) incrementally refines the linear logistic model. In each iteration, instead of using all features to perform linear regression, SimpleLogistic uses only one feature to train the model. In this way, only relevant features are selected, and the risk of over-fitting is reduced.

Gradient Boosting with Piece-wise Linear Regression Trees (GBDT-PL) [43] is a variant of XGBoost, which combines multiple Piece-wise Linear Regression Trees (PL-Trees) to predict the output. Similar to LMT, a PL-Tree fits a linear model at each node in an additive manner.

Webb [44] proposed MultiBoosting, which combines wagging (a variant of Bagging) and Adaboost [25] to reduce both the bias and the variance. It uses AdaBoost to generate a sub-committee of decision trees. By wagging a set of subcommittees, MultiBoosting may achieve lower error than AdaBoost.

RandomForest [26] uses two techniques to improve the diversity of each tree, and hence reduces over-fitting: 1) Bagging, i.e., each tree is trained with a bootstrap replica drawn from the original training set; and, 2) feature subsampling, i.e., for each node of the tree, a subset of the features is randomly selected from the complete feature set, then an optimal feature is selected from the subset to split the node.

Extra-Trees [29] splits nodes in a more random way than RandomForest. It first randomly draws one cut-point for each feature, and then selects the cut-point with the largest splitting gain to split a node. Extra-Trees reduces over-fitting by trading the base learners’ accuracy for diversity.

TAO [45] first initializes the tree with a given depth by using CART [13] or random parameters, and then circularly updates each decision or leaf node to optimize the objective function. It trains a binary classifier at each decision node to classify a sample into the left or right child node, and a constant label (TAO-c) or linear softmax classifier (TAOl) at each leaf to predict the output. Its decision nodes are oblique nodes, which can split the samples more effectively than axis-aligned nodes. Bagged TAO-l (BaggedTAO-l) [46] further uses Bagging to integrate multiple TAO-l trees for better performance.

3 BOOSTTREE AND BOOSTFOREST

This section introduces the details of our proposed BoostTree and BoostForest. All source code is available at https://github.com/zhaochangming/BoostForest.

3.1 Motivation

Three problems need to be solved in model tree based ensemble learning:

1) How to design a model tree that is more compatible with Bagging? Generally, as the model complexity

increases, the bias of the model decreases, but the variance increases. Between the two popular ensemble learning strategies, Bagging is suitable for integrating complex base learners to reduce the variance, whereas boosting for integrating simple base learners to reduce the bias. MultiBoosting combines AdaBoost and wagging to further improve the performance of a decision tree. However, how to combine GBM and Bagging to improve the performance of model trees has not been studied. LMT is a stable model with low randomness, so simply combining multiple LMTs with Bagging may not outperform a single LMT.

2) How to handle both classification and regression problems by using a single model tree structure? LMT applies LogitBoost to a decision tree to generate an accurate model tree; however, it cannot handle regression problems. GBDT-PL handles regression problems by using GBM to integrate multiple PL-Trees. However, performing boosting in a single tree is more efficient, and easier to be used in Bagging.

3) How to generate a model tree with easy parameter tuning, so that any regression algorithm, e.g., RR, ELM and SVR, can be used as its node function?

This section proposes BoostTree to solve the above three problems simultaneously. Moreover, multiple BoostTrees can be further integrated into a BoostForest for better performance.

3.2 General Idea of BoostTree

Our proposed BoostTree is inspired by LMT and GBM. Assume a BoostTree has M nodes, excluding the root. Then, it trains a function for the m-th node, , and uses (5) to predict the output.

BoostTree minimizes the following regularized loss function:

where is the regularization coefficient. The second term above penalizes the complexity of BoostTree to reduce over-fitting.

Different loss functions can be used to cope with regression and classification problems, as will be introduced in Sections 3.3-3.5. For the ease of optimization, we require to be convex and differentiable.

In general, the objective function in (6) cannot be optimized directly. Inspired by LMT and GBM, BoostTree minimizes (6) in an additive manner. Assume a tree with leaves has been generated after iterations. Then, there are nodes, excluding the root. We can rewrite (6) as:

where

i.e., is the set of all training samples belonging to Leaf m. measures the loss of Leaf m. In each iteration, BoostTree uses a greedy learning scheme to add branches to the leaf with the highest loss.

Assume Node m is the leaf node with the highest loss. After the split, is divided into two subsets: (of the left node) and (of the right node). Let and be the node models of the left and the right nodes trained separately using and , respectively. Then, the reduction of the loss in (7) is:

where

Note that C is a constant, is the ensemble of the models along the path from the root to Node m, and and are the loss functions for the left and right child nodes, respectively.

(11) can be optimized by minimizing and separately. More specifically, BoostTree uses two steps to optimize (11):

1) Split the node: BoostTree implements four criteria to select the cut-point:

a) XGBoost Splitting Criterion (XGB-SC), which uses gradient decent to reduce the loss.

b) Gini Splitting Criterion (Gini-SC), which tries to improve the purity of each leaf.

c) Mean Squared Error (MSE), which tries to reduce the MSE of each leaf.

d) C4.5 Splitting Criterion (C4.5-SC) [47], which selects the cut-point with the largest ratio of information gain.

Though all criteria are effective, XGB-SC achieved the best performance in our experiments, as will be demonstrated in Section 4.10.3; hence, we make it the default option in BoostTree.

XGB-SC uses the Euclidean norm of the leaf score to constrain the tree complexity. Ignoring the parent node loss and inspired by Equation (7) in XGBoost [12], we can calculate the splitting gain as:

are the first and second order derivatives of the loss function w.r.t. , respectively.

Similar to Extra-trees for increasing the randomness, BoostTree first draws a random cut-point uniformly for each feature, and then determines the best cut-point of the current node according to the maximum .

2) Train the node model: BoostTree uses gradient boosting to decompose the original problem into multiple sub-regression problems, and uses a node function to solve a sub-regression problem in each node. It first calculates the pseudo-labels and sample weights to generate a temporary training set, and then trains a regression model for regression or binary classification, or J regression models for J-class (J > 2) classification. For simplicity, RR is used as the default node function. More details on training the node model for different tasks are introduced in Sections 3.3-3.5.

For the ease of implementation, we can simply store all possible values of each parameter in a parameter pool, from which BoostTree randomly selects a combination of parameters, e.g., the minimum number of samples at a leaf minsamplesleaf, and the regularization coefficient . Robustness of using different parameters to train a BoostTree in BoostForest will be studied in Section 4.10.1.

Algorithm 1 shows the pseudo-code of BoostTree using RR as the node function, where the subfunction FitModel assumes different forms according to different learning tasks, as shown in Algorithms 2-4.

3.3 BoostTree for Regression

For regression problems, we use

Assume Node c is the left or right child node of Node m. According to GBM, the loss function for Node c can be expressed as:

where is the pseudo- label, i.e., the residual between the true value and the prediction, is the set of all training samples belonging to Node is a vector of the regression coefficients, and is the intercept. Note that all samples have the same weight in regression.

MSE is sensitive to noise and outliers. To improve BoostTree’s robustness, we save the lower (upper) bound lb (ub) of

Algorithm 1: BoostTree using RR as the node function.

Input: training samples, where , candidate value pool of the minimum number of samples at a leaf min samplesleaf; , candidate value pool of the regularization parameter ; (optional) MaxNumLeaf, the maximum number of leaves, default NULL. Output: A BoostTree. Randomly select and MSL from and , respectively; Initialize NumLeaf = 1, LeafList = {} and f(x) = 0; ; /* Initialize the root node. */ Add root to LeafList; ;

Use (8) to calculate the leaf loss of each node in LeafList; , where is the leaf node with the highest loss; end return node; }

each node’s pseudo-label at the training stage, and clip the corresponding node model’s output at the prediction stage:

When the tree is growing, the residuals of the nodes generated at later iterations approach zero, so lb and ub get closer. The effect of clipping will be demonstrated in Section 4.10.2.

Algorithm 2 shows the pseudo-code of BoostTree for regression.

3.4 BoostTree for Binary Classification

In classification tasks, BoostTree is built using a LogitBoostlike algorithm, which iteratively updates the ensemble of the logistic linear models F by adding a new regression model f to it.

The cross-entropy loss is used in binary classification:

where .

Assume Node c is the left or right child node of Node m. According to LogitBoost (Supplementary Algorithm 1) [41], the second order Taylor expansion can be used to approximate the loss function for Node c:

where

is a constant, and and are calculated by using (13) and (14), respectively. Note that and are irrelevant to . Therefore, can be any regression model.

Then, we can construct the pseudo-label and the sample weight , as in LogitBoost:

where

in which is the estimated probability in (2).

To improve BoostTree’s robustness to outliers, we follow LogitBoost [41] to limit the minimum weight to is the machine epsilon), and clip the value of the pseudo-label to:

where (according to LogitBoost [41]). was used in our experiments.

Finally, we can remove C in (19) to simplify the loss function for child Node c:

Algorithm 3 shows the pseudo-code of BoostTree for binary classification.

3.5 BoostTree for J-Class (J > 2) Classification

For J-class classification, we use

where is the one-hot encoding label vector, and is the output vector.

In each iteration, LogitBoost handles the J-class clas-sification problem by decomposing it into J regression problems, so becomes a set of linear models , , , , where calculates the output for Class j.

Similarly, the second order Taylor expansion can be used to approximate the loss function for Node c:

where

is a constant, is the coefficient vector of , is the j- th element of , and is the j-th diagonal element of and are again calculated by using (13) and (14), respectively.

Then, for the j-th class, we can calculate the pseudo-label and sample weight , as in LogitBoost:

where

in which is the estimated probability of Class j in (4). To prevent too large pseudo-labels, we also use (24) to clip .

Finally, we can remove C in (27) to simplify the loss function for Node c:

Algorithm 4 shows the pseudo-code of BoostTree for J-class (J > 2) classification.

3.6 BoostForest

BoostForest integrates multiple BoostTrees into a forest. It first uses bootstrap to generate K replicas of the original training set, and then trains a BoostTree on each replica.

For regression, the outputs predicted by all K BoostTrees are averaged as the final output:

For classification, the probabilities predicted by all K BoostTrees are averaged as the final probability:

Algorithm 5 gives the pseudo-code of BoostForest.

3.7 Implementation Details

Inspired by the row sampling trick in XGBoost and lightGBM, we use two tricks to improve the speed of BoostTree:

1) Randomly select BatchSize samples to identify the cut-point, if the number of samples belonging

to the splitting node is larger than the batch size BatchSize.

2) When the number of samples belonging to Node m is larger than BatchSize, randomly select BatchSize samples to train the node model, and approximate as:

BatchSize = 1, 000 was used in our experiments.

To ensure BoostTree’s number of leave nodes is larger than or equal to two, BoostTree tries to split the root node until it is split successfully.

3.8 Discussions

To clarify the novelties of BoostForest, this subsection first discusses the differences and connections between it and some related approaches, e.g., LMT, GBDT-PL, MultiBoosting and TAO, and then summarizes approaches which motivate d BoostForest.

There are four main differences between LMT and BoostForest:

1) LMT uses SimpleLogistic to train the node model, and only handles classification. BoostTree can use any regression algorithm to train the node model, and can handle both classification and regression.

2) LMT computes the splitting gains on all values of the attributes to select the cut-point, whereas BoostTree draws the cut-points completely randomly, which increases its diversity and makes BoostTrees more suitable for Bagging.

3) LMT prunes the tree to reduce over-fitting and uses only one tree to predict the output. However, a single learner can easily be affected by noise and outliers, leading to poor generalization. BoostForest combines multiple BoostTrees with Bagging to reduce over-fitting.

4) LMT’s asymptotic complexity is (ignoring the pruning operation), where Depth is the depth of the tree, N is the sample size, and D is the feature dimensionality. BoostTree’s complexity is (using RR as the node function), where the first part is the cost of building an Extra-tree, and the second is the cost of building the node models. If downsampling is performed before fitting the RR model, BoostTree’s asymptotic complexity is less than , where M is the number of nodes.

M5 and M5P train a linear model at each node during the training process. Their smoothing operations use all node models along the path from the root to a leaf to predict the output. M5 and M5P construct model trees for regression, whereas LMT constructs them for classification. LMT is the most relevant approach to BoostTree. They both integrate GBM into one single tree, train a base learner at each node to perform boosting, and combine all node models along the path from the root to the corresponding leaf to predict the output. BoostTree may be viewed as an upgraded LMT.

BoostTree improves LMT from three aspects:

1) Model structure: BoostTree expands LMT’s node models to non-linear models.

2) Application scenario: BoostTree further extends LMT to regression problems.

3) Training process: BoostTree optimizes the cut-point selection strategy by combining XGB-SC and Extratrees’ cut-point drawing strategy, and also the parameter setting strategy. These improvements make it easier to tune the hyper-parameters in BoostTree. Note that LMT can automatically perform cross-validation within each node to determine the number of training iterations for SimpleLogistic, and prune the tree to further reduce over-fitting. However, when the node models are replaced by other more complex models, the computational cost of cross-validation at all nodes may be very high. To alleviate this problem, BoostTree develops a random parameter pool sampling strategy and combines it with Bagging.

There are four main differences between GBDT-PL and BoostForest:

1) GBDT-PL trains a linear model at each node, and does not evaluate its performance on multi-class classification. BoostTree can use any regression algorithm to train the node model, and can handle binary and multi-class classifications, and regression.

2) GBDT-PL splits the node and trains the node models simultaneously, whereas BoostTree first splits the node, and then trains the node models. When the node models are replaced by other complex models, it’s very time-consuming for GBDT-PL to select the cut-point, because it needs to train the left and right node models at all candidate cut-points.

3) GBDT-PL uses GBM to train multiple PL-Trees, whereas BoostTree integrate GBM into one single tree.

4) GBDT-PL needs a validation set to select its hyper-parameters, which include the tree structure and node model parameters. Its computational cost increases with the number of hyper-parameter combinations. BoostTree optimizes its hyper-parameter setting strategy by integrating random parameter pool sampling and Bagging.

When each node uses a linear model, BoostTree has a similar structure as a tree in GBDT-PL, because all linear node models along the path to the corresponding leaf can be combined into one single linear model. However, when each node uses a non-linear model, the node models cannot be easily combined, which makes the structures of BoostTree and GBDT-PL tree different.

There are three main differences between MultiBoosting and BoostForest:

1) MultiBoosting uses traditional decision trees as the base learners, whereas BoostForest uses BoostTrees.

Each BoostTree is trained by gradient boosting, so itself can be viewed as an ensemble model.

2) MultiBoosting uses first AdaBoost to train multiple decision trees sequentially to reduce the bias, and then wagging (a variant of Bagging) to further reduce the variance. BoostForest first generates multiple BoostTrees independently, and then uses the idea of Random Forest to reduce their variance.

3) MultiBoosting can only be used for classification, whereas BoostForest can handle both classification and regression.

MultiBoosting and BoostForest both combine boosting and Bagging, but in different ways: MultiBoosting combines them serially, whereas BoostForest combines them in parallel.

TAO is a general and efficient optimization algorithm, which can train many types of decision trees, e.g., both TAO and BoostTree can train the base learners in a tree ensemble. However, there are three main differences between TAO and BoostTree:

1) TAO trains a binary classifier at each decision node to classify a sample into the left or right child node. Its decision nodes are oblique nodes, whereas BoostTree’s decision nodes are axis-aligned nodes.

2) TAO uses only the leaf model to predict the output, whereas BoostTree combines all node models along the path from the root to the corresponding leaf to predict the output.

3) TAO circularly updates each decision or leaf nodes to optimize the objective function, whereas BoostTree optimizes it by using a greedy learning scheme to add new branches to the leaf.

Table 1 summarizes the approaches that motivated BoostForest. Essentially, BoostForest integrates five existing strategies and develops two new empirical ones.

TABLE 1 Approaches that motivated BoostForest.

4 EXPERIMENTAL RESULTS

Extensive experiments were carried out to verify the effectiveness of BoostForest in both classification and regression. The following seven questions were examined:

1) What is the generalization performance of BoostForest, compared with several popular ensemble learning approaches, e.g., RandomForest [26], Extra-Trees [29], XGBoost [12], LightGBM [35], MultiBoosting [44], GBDT-PL [43], FilterBoost [33] and DeepBoosting [34]?

2) How fast does BoostForest converge, as the number of base learners increases?

3) How does the base learner model complexity affect the generalization performance of BoostForest?

4) Can BoostForest handle datasets with a large number of samples or features?

5) Can our proposed approach for constructing BoostForest, i.e., data replica by bootstrapping and random parameter selection from the parameter pool, also be used to integrate other tree models, e.g., Extra-Tree [29], M5P [19], and LMT [40]?

6) How does the performance of BoostForest change when different node functions, e.g., RR, ELM or SVR, are used in BoostTrees?

7) How robust is BoostForest to its hyper-parameters and the node splitting criterion?

4.1 Datasets

We performed experiments on 37 real-world datasets1 (20 for classification and 17 for regression), summarized in Supplementary Table 1. They covered a wide range of feature dimensionalities (between 4 and 1,024) and sample sizes (between 103 and 11,000,000). For each dataset, categorical features were converted to numerical ones by one-hot encoding. Unless stated otherwise, each feature’s distribution was scaled to a standard normal distribution, and the labels in the regression datasets were z-normalized.

4.2 Algorithms

Supplementary Table 2 shows the hyper-parameters of the 11 baselines and BoostForest used in our experiments. More details about the baseline parameters can be found online2. Unless stated otherwise, BoostForest used default parameter values.

4.3 Experimental Setting

We performed experiments on 30 small to medium sized datasets (15 for classification and 15 for regression) in Sections 4.4-4.6 and 4.10, 35 datasets (30 small to medium sized ones and five large sized ones) in Section 4.7, 32 datasets (30 small to medium sized ones and two high dimensional ones) in Section 4.8, and 17 datasets (15 small to medium sized ones and two large sized ones) in Section 4.9.

All algorithms were repeated 10 times on each dataset (except MNIST and HIGGS). For each experiment, we first randomly divided the data into 60% training, 20% validation and 20% test. For MNIST or HIGGS, we randomly divided the data into 80% training, 20% validation, and used the original test data for testing. Next, we used the validation set to select the best parameters. Then, the training set and validation set were combined to train a model with these parameters. Finally, we verified its performance on the test set. For RandomForest and Extra-Trees, we used the out-of-bag error to select their parameters.

We used the classification accuracy and the root mean squared error (RMSE) as the main performance measure for classification and regression, respectively. We also computed the average rank and training time (including the process of parameter selection) for each algorithm on each dataset. For K algorithms, the best one has rank 1, and the worst has rank K. We used seconds as the unit of time in our experiments and 50 processing cores to run the experiments on a server with two Intel(R) Xeon(R) Platinum 8276 CPUs. Additionally, we used bytes as the unit of model size in our experiments.

To validate if BoostForest significantly outperformed the baselines (), we first calculated the p-values using the standard t-test, and then performed Benjamini Hochberg False Discovery Rate (BH-FDR) correction [48] to adjust them. The statistically significant ones are marked by .

4.4 Generalization Performance of BoostForest

First, we compared the generalization performance of BoostForest with eight baselines. The results are shown in Tables 2 and 3.

Table 2 shows that BoostForest achieved the best generalization performance on 17 out of the 30 datasets, the fastest average time in regression, the second fastest average time in classification, and the best average performance, standard deviation and rank in both classification and regression.

Table 3 shows that BoostForest achieved the best generalization performance on nine out of the 11 datasets, the best average performance, standard deviation and rank, and the second fastest average time.

On average, BoostForest achieved the best classifica-tion accuracy or regression RMSE. It significantly outperformed RandomForest on 19 datasets, Extra-Trees on 15 datasets, XGBoost on 14 datasets, LightGBM on 12 datasets, MultiBoosting on nine datasets, GBDT-PL on 10 datasets, DeepBoosting on eight datasets, and FilterBoost on seven datasets.

It is also important to note that BoostForest used the default parameter settings on all datasets, and did not use a validation set to select the best parameters. So, it is very convenient to use. Additionally, each BoostTree is trained with a different training set, so BoostForest can be easily parallelized to further speed it up.

One limitation of BoostForest is that its average model size was the largest, so it may not be easily deployed to low storage devices. One of our future research directions is to reduce its model size.

4.5 Generalization Performance w.r.t. the Number of Base Learners

BoostForest needs to specify the number of BoostTrees in it. It is important to study how its performance changes with

this number.

On each dataset, we gradually increased the number of base learners from three to 250, and tuned other parameters of the four baselines by using the validation set (for XGBoost and LightGBM) or the out-of-bag error (for RandomForest and Extra-Trees).

Fig. 1 shows the accuracies of the five algorithms on the last four classification datasets. Complete results on all 15 classification datasets are shown in Supplementary Fig. 1. Generally, as the number of base learners increased, the performances of all ensemble learning approaches quickly converged. BoostForest achieved the highest classification accuracy on 11 of the 15 datasets, and the second highest on another three (MV1, BD, and PID).

Fig. 2 shows the RMSEs of the five algorithms on the last four regression datasets. Complete results on all 15 regression datasets are shown in Supplementary Fig. 2. Again, as the number of base learners increased, generally the performances of all algorithms quickly converged. BoostForest achieved the smallest RMSE on seven of the 15 datasets.

In summary, BoostForest has very good generalization performance w.r.t. the number of base learners, and it generally converges faster than the four baselines (RandomForest, Extra-Trees, XGBoost, and LightGBM).

4.6 Generalization Performance w.r.t. the Base Learner Model Complexity

We also evaluated the generalization performance of the five ensemble approaches, as the base learner model complexity increases.

The base learner model complexity was controlled by the maximum number of leaves MaxNumLeaf per tree, which was gradually increased from two to 32 for classification, and two to 256 for regression. We fixed the number of base learners at 250, and tuned other parameters of the four baselines by using the validation set (for XGBoost and LightGBM) or the out-of-bag error (for RandomForest and Extra-Trees).

Fig. 3 shows the accuracies of the five algorithms on the last four classification datasets. Complete results on all 15 classification datasets are shown in Supplementary Fig. 3. BoostForest achieved the highest classification accuracy on 12 of the 15 datasets.

Fig. 4 shows the average RMSEs of the five algorithms on the last four regression datasets. Complete results on all 15 regression datasets are shown in Supplementary Fig. 4. On most datasets, the performances of all algorithms increased as the maximum number of leaves per tree increased. BoostForest achieved the smallest RMSE on seven of the 15 datasets, and the second smallest RMSE on the NO dataset.

These results suggest that BoostForest generalize well with the base learner model complexity.

4.7 Generalization Performance on Large Datasets

Previous experiments have shown the superiority of BoostForest on small to medium sized datasets with not very high dimensionalities. This subsection investigates its performance on four datasets with a large number of samples and/or features.

Fig. 2. Generalization performance w.r.t. the number of base learners, averaged over 10 repeats. (a) Average classification accuracies on the last four classification datasets, with different number of base learners. Complete results on the 15 classification datasets are shown in Supplementary Fig. 1. (b) Average RMSEs on the last four regression datasets, with different number of base learners. Complete results on the 15 regression datasets are shown in Supplementary Fig. 2.

Fig. 3. Generalization performance w.r.t. the base learner model complexity, averaged over 10 repeats. (a) Average classification accuracies on the last four classification datasets, with different maximum number of leaves. Complete results on the 15 classification datasets are shown in Supplementary Fig. 3. (b) Average RMSEs on the last four regression datasets, with different maximum number of leaves. Complete results on the 15 regression datasets are shown in Supplementary Fig. 4.

TABLE 2 Performances of the seven ensemble learning approaches on the 30 datasets. The best performance is marked in bold. indicates a statistically significant win for BoostForest.

TABLE 3 Performances of the four ensemble learning approaches on the 11 binary classification datasets. The best performance is marked in bold. • indicates a statistically significant win for BoostForest.

Table 4 compares the performance of BoostForest with four classical and popular ensemble methods on three classification datasets and two regression datasets. For the SUSY dataset, we set minsamplesleaf = 2 and MaxNumLeaf = 20, 000 in each BoostTree.

BoostForest still demonstrated superior and consistent performance: it achieved the highest average accuracy and the second smallest average time in classification, and the lowest average RMSE and the third smallest average time in regression.

Considering that BoostForest’s average model size was the largest and its base learners were more complex than baselines’, we also compared BoostForest with BaggingLightGBM, which uses Bagging to integrate 600 LightGBMs, and BaggedTAO-l, which also uses Bagging to integrate multiple complex trees. To increase BaggedTAO-l’s training speed on YB and LR, we selected the cut-point from the set of each feature’s {10, 20, . . . , 90} quantiles to initialize each

TABLE 4 Performances of the five ensemble learning approaches on datasets with a large number of samples and/or features. The best performance is marked in bold. indicates statistically significant win for BoostForest.

TAO-l in BaggedTAO-l. LMT and M5P are also complex trees, which are compared with BoostForest in Section 4.8.

Supplementary Table 3 compares BoostForest with Bagging-LightGBM. BoostForest significantly outperformed Bagging-LightGBM on 15 datasets, and achieved higher accuracies on 14 of the 18 classification datasets, and lower RMSEs on 11 of the 17 regression datasets. These results indicated that BoostForest can still achieve promising performance when its average model size was smaller than Bagging-LightGBM.

Supplementary Table 4 compares BoostForest with BaggedTAO-l. BoostForest significantly outperformed BaggedTAO-l on four datasets, and achieved higher accuraies on 16 of the 17 classification datasets. BaggedTAOl had smaller average model size, because TAO can use the sparsity penalty to remove unnecessary parameters. This idea may also be used to reduce BoostForest’s model size.

4.8 Use Other Base Learners in BoostForest

Next, we studied if the strategy that BoostForest uses to combine multiple BoostTrees (data replica by bootstrapping, and random parameter selection from a parameter pool) can also be extended to other tree models, i.e., whether we can still achieve good ensemble learning performance when BoostTree is replaced by another tree model.

When Extra-tree, LMT and M5P were used as the tree model, the resulting forests were denoted as ETForest, LMForest and ModelForest, respectively. Individual M5P in ModelForest, LMT in LMForest, and Extratree in ETForest, was not pruned, and minsamplesleaf was randomly sampled from {5, 6, . . . , 15}. The parameters tuned for the baseline Extra-tree were maxdepth and minsamplesleaf, whose candidate values were {1, 2, 4, 6, 8, 10} and {5, 10, 15}, respectively. We set minsamplesleaf = 10 and in BoostTree.

Supplementary Table 5 shows the results. ETForest outperformed Extra-tree on 30 of the 32 datasets. ModelForest outperformed M5P on all 16 regression datasets. BoostForest outperformed BoostTree on all 32 datasets. These results indicated that our proposed strategy for integrating BoostTrees into BoostForest can also be used to integrate Extra-tree and M5P into a composite learner with improved performance.

However, LMForest only outperformed LMT on six of the 16 classification datasets, though it achieved better average performance than LMT, i.e., our proposed strategy is not very effective in combining multiple LMTs to further improve their performance. Additionally, using Python to call LMT’s and M5P’s Java APIs to generate the base learners in parallel is not very efficient, so the average time of LMForest and ModelForest was relatively long.

Finally, BoostForest outperformed LMForest on 14 datasets, among which five were statistically significant; BoostForest outperformed ETForest on 31 datasets, among which 24 were statistically significant; and, BoostForest outperformed ModelForest on 14 datasets, among which 11 were statistically significant.

In summary, BoostForest achieved better average classi-fication performance than LMForest and ETForest, and also better average regression performance than ModelForest and ETForest, indicating that BoostTree is a more effective base learner for our proposed ensemble strategy than ExtraTree, LMT and M5P.

4.9 Use Other Regression Models in BoostTree

We also studied if other more complex and nonlinear regression algorithms, e.g., ELM and SVR, can be used to replace RR as the node function in BoostTree. The resulting trees are denoted as BoostTree-ELM and BoostTree-SVR, respectively, and the corresponding forests as BoostForestELM and BoostForest-SVR.

ELM [20] is a single hidden layer neural network. It randomly generates the hidden nodes, and analytically determines the output weights through generalized inverse or RR. Its model complexity can be controlled by the number of hidden nodes NumHiddenNodes and the regularization coefficient of RR. We set their candidate values to {10, 20, 30, 40} and {0.0001, 0.001, 0.01, 0.1, 1}, respectively, to construct the parameter pool. The sigmoid activation function was used in the hidden layer.

Linear SVR [21] was used in BoostTree-SVR. The parameter pool for the regularization parameter C and the slack variable was {0.01, 0.1, 1.0, 10, 100} and {0.1, 0.2, 0.4, 0.8, 1.0}, respectively. We set minsamplesleaf = 10 in BoostTree-ELM and BoostTree-SVR, and BoostForestELM and BoostForest-SVR to {5, 6, . . . , 15}, respectively. Details of BoostTree-ELM, BoostForest-ELM, BoostTree-SVR and BoostForest-SVR are described in Supplementary Algorithms 2-7.

Supplementary Table 6 compares the performances of ELM and BoostTree-ELM with BoostForest-ELM (SVR and BoostTree-SVR with BoostForest-SVR) on the 15 regression datasets. BoostForest-ELM statistically significantly outperformed ELM (BoostTree-ELM) on 13 (15) datasets. BoostForest-SVR statistically significantly outperformed SVR (BoostTree-SVR) on 14 (15) datasets. BoostTree-ELM and BoostTree-SVR were more likely to overfit, because of their high model complexity. So, it is necessary to combine multiple BoostTrees into BoostForest to reduce over-fitting.

These results showed that when more complex and nonlinear regression algorithms, such as ELM and SVR, are used to replace RR as the node function in BoostTree, the performance may degrade due to overfit; however, integrating the corresponding BoostTrees into a BoostForest can always improve the performance.

We also explored if convolutional neural network (CNN) and deep neural network (DNN) can be used as the node function in BoostForest. The resulting forests are denoted as BoostForest-CNN and BoostForest-DNN, respectively. Supplementary Fig. 5 shows the structures of CNN and DNN used in our experiments. We initialized each node model of BoostForest-CNN or BoostForest-DNN by using its parent node model’s weights, scaled its output values to , to be consistent with the range of pseudo-label in (24), and trained it with 20 epochs. For the MNIST dataset, we set minsamplesleaf = 1, 024, batch size to 1,024, and tree depth to 5 in each BoostTree, and scaled the minimum and maximum values of each feature to 0 and 1, respectively. For the HIGGS dataset, we set minsamplesleaf = 2, 048, batch size to 2,048, and tree depth to 5 in each BoostTree, and randomly selected 2,000,000 samples to train each base learner in BoostForest-DNN and Bagging-DNN, respectively. If the number of samples belonging to the splitting node was larger than 100,000, we randomly selected 100,000 samples to identify the cut-point. When training CNN or DNN, we adopted AdamW3 with betas of (0.9, 0.999), initial learning rate 0.01, and weight decay 0.0001.

Table 5 compares the performances of BoostForest-CNN and BoostForest-DNN with six approaches on MNIST and HIGGS, respectively. With less training time, BoostForestCNN and BoostForest-DNN achieved higher accuracies than Bagging-CNNand Bagging-DNN, respectively. With smaller model size, BoostForest-CNN (BoostForestDNN) achieved higher accuracies than RandomForest, Extra-Trees, Bagging-XGBoost, Bagging-lightGBM, and Bagging-CNN(Bagging-DNN). These results indicated that complex CNN and DNN can also be used as the node function in BoostForest.

TABLE 5 Performances of the seven ensemble learning approaches on MNIST and HIGGS. The highest classification accuracy is marked in bold. [means the training time was recorded by running the approach on a NVIDIA RTX 3090 GPU. CNNwere trained with 150 epochs, and CNNwere trained with three epochs.

4.10 Robustness of BoostForest

To investigate the robustness of BoostForest, we performed extensive experiments to study how its performance changed with different hyper-parameters. Additionally, we also studied the effect of the clipping operation in (17) and the splitting criterion.

4.10.1 Effect of min

There are mainly two hyper-parameters in our proposed BoostForest: minsamplesleaf and . Supplementary Table 7 shows the average performances of BoostForest, as minsamplesleaf increased from 5 to 15, and increased from 0.0001 to 1. Randomly selecting each BoostTree’s parameters from the parameter pool to form a BoostForest achieved comparable average performance with using the best parameters in both regression and classification, indicating that we can use random parameter pool sampling to simplify BoostForest’s parameter selection process.

4.10.2 Effect of the Clipping Operation in Regression

Clipping makes BoostForest more robust to noise and outliers in regression. To verify the effect of the clipping operation, we compared the performance of BoostForest (w/o clipping)with BoostForest (w/ clipping).

Supplementary Table 8 shows the results. BoostForest (w/ clipping) outperformed BoostForest (w/o clipping) on most datasets. When the data noise was small, which means BoostForest can achieve a small RMSE, clipping made BoostForest too conservative to achieve better performance, e.g., on the AQ dataset.

Compared with the baselines in Table 2, even BoostForest (w/o clipping) achieved better average RMSE than RandomForest, Extra-Trees, XGBoost, LightGBM and GBDTPL, indicating that BoostForest (w/o clipping) is also an effective model, though BoostForest (w/ clipping) is more effective.

4.10.3 Effect of the Splitting Criterion

We also studied if Gini-SC, C4.5-SC (usually used in C4.5 and LMT) or MSE-SC (usually used in CART and ExtraTrees), can replace XGB-SC in BoostForest.

Supplementary Table 9 shows BoostForest’s performances of using different splitting criteria. Using Gini-SC, C4.5-SC and XGB-SC achieved comparable average accuracies in classification. Using XGB-SC achieved better average RMSE than using MSE-SC in regression.

1) Improve BoostTree to handle NULL values. 2) Reduce BoostForest’s model size by using the sparsity penalty in TAO [45].

REFERENCES

[1] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: CRC Press, 2012.

[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York City, NY: Springer, 2009.

[3] S. Wang, K. Fan, N. Luo, Y. Cao, F. Wu, C. Zhang, K. A. Heller, and L. You, “Massive computational acceleration by using neural networks to emulate mechanism-based biological models,” Nature Communications, vol. 10, no. 1, pp. 1–9, 2019.

[4] X. Sun, Y. Liu, and L. An, “Ensemble dimensionality reduction and feature gene extraction for single-cell RNA-seq data,” Nature Communications, vol. 11, no. 1, pp. 1–9, 2020.

[5] Y. Cao, T. A. Geddes, J. Y. H. Yang, and P. Yang, “Ensemble deep learning in bioinformatics,” Nature Machine Intelligence, vol. 2, no. 9, pp. 500–508, 2020.

[6] S. Paisitkriangkrai, C. Shen, and A. v. d. Hengel, “Pedestrian detection with spatially pooled features and structured ensemble learning,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1243–1257, 2016.

[7] C. Yang, H. Zhao, L. Bruzzone, J. A. Benediktsson, Y. Liang, B. Liu, X. Zeng, R. Guan, C. Li, and Z. Ouyang, “Lunar impact crater identification and age estimation with Chang’E data by deep and transfer learning,” Nature Communications, vol. 11, no. 1, pp. 1–15, 2020.

[8] Y. F. Li, L. Z. Guo, and Z. H. Zhou, “Towards safe weakly supervised learning,” IEEE Trans.on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 334–346, 2021.

[9] S. S. Kim, K. K. Dey, O. Weissbrod, C. Marquez-Luna, S. Gazal, and A. L. Price, “Improving the informativeness of Mendelian disease pathogenicity scores for common disease,” Nature Communications, vol. 11, pp. 1–15, 2020.

[10] R. Agius, C. Brieghel, M. A. Andersen, A. T. Pearson, B. Lederg- erber, A. Cozzi-Lepri, Y. Louzoun, C. L. Andersen, J. Bergstedt, J. H. von Stemann et al., “Machine learning can identify newly diagnosed patients with CLL at high risk of infection,” Nature Communications, vol. 11, no. 1, pp. 1–17, 2020.

[11] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189– 1232, 2001.

[12] T. Q. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, San Francisico, CA, Aug. 2016, pp. 785– 794.

[13] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.

[14] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.

[15] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classi- fier ensembles and their relationship with the ensemble accuracy,” Machine Learning, vol. 51, no. 2, pp. 181–207, 2003.

[16] E. K. Tang, P. N. Suganthan, and X. Yao, “An analysis of diversity measures,” Machine Learning, vol. 65, no. 1, pp. 247–271, 2006.

[17] G. Brown, J. Wyatt, R. Harris, and X. Yao, “Diversity creation methods: A survey and categorisation,” Information Fusion, vol. 6, no. 1, pp. 5–20, 2005.

[18] J. R. Quinlan, “Learning with continuous classes,” in Proc. 5th Australian Joint Conf. on Artificial Intelligence, Tasmania, Australia, Nov. 1992, pp. 343–348.

[19] Y. Wang and I. H. Witten, “Induction of model trees for predicting continuous classes,” in Proc. 9th European Conf. on Machine Learning, Prague, Czech Republic, 1997.

[20] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning ma- chine: Theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.

[21] H. Drucker, C. J. Burges, L. Kaufman, A. J. Smola, and V. Vapnik, “Support vector regression machines,” in Proc. Advances in Neural Information Processing Systems, Cambridge, MA, Dec. 1997, pp. 155–161.

[22] A. Jain, J. Mao, and K. Mohiuddin, “Artificial neural networks: A tutorial,” Computer, vol. 29, no. 3, pp. 31–44, 1996.

[23] L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is NP-complete,” Information Processing Letters, vol. 5, no. 1, pp. 15–17, 1976.

[24] Z. H. Zhou and J. Feng, “Deep forest,” National Science Review, vol. 6, no. 1, pp. 74–86, 2018.

[25] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proc. 13th Int’l Conf. on Machine Learning, Bari, Italy, Jul. 1996, pp. 148–156.

[26] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[27] I. Barandiaran, “The random subspace method for constructing decision forests,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 1–22, 1998.

[28] Y. Amit and D. Geman, “Shape quantization and recognition with randomized trees,” Neural Computation, vol. 9, no. 7, pp. 1545– 1588, 1997.

[29] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, 2006.

[30] T. Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class AdaBoost,” Statistics and Its Interface, vol. 2, no. 3, pp. 349–360, 2009.

[31] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics and Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.

[32] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean, “Boosting algo- rithms as gradient descent,” in Proc. Advances in Neural Information Processing Systems, Breckenridge, Colorado, Dec. 2000, pp. 512– 518.

[33] J. K. Bradley and R. E. Schapire, “Filterboost: Regression and classification on large datasets,” in Proc. 20th Int’l Conf. on Neural Information Processing Systems, Vancouver, Canada, Dec. 2007, pp. 185–192.

[34] C. Cortes, M. Mohri, and U. Syed, “Deep boosting,” in Proc. 31st Int’l Conf. on Machine Learning, Beijing, China, Jun. 2014, pp. 1179– 1187.

[35] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Y. Liu, “LightGBM: A highly efficient gradient boosting decision tree,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 3146–3154.

[36] T. Q. Chen and T. He, “Higgs boson discovery with boosted trees,” in Proc. Advances in Neural Information Processing Systems, Montreal, Canada, Dec. 2015, pp. 69–80.

[37] A. Rakhlin, A. Shvets, V. Iglovikov, and A. A. Kalinin, “Deep convolutional neural networks for breast cancer histology image analysis,” in Proc. 15th Int’l Conf. on Image Analysis and Recognition, Povoa de Varzim, Portugal, Jun. 2018, pp. 737–744.

[38] L. Liu, M. Ji, and M. Buchroithner, “Combining partial least squares and the gradient-boosting method for soil property retrieval using visible near-infrared shortwave infrared spectra,” Remote Sensing, vol. 9, no. 12, pp. 1299–1317, 2017.

[39] J. Walsh, I. T. Heazlewood, and M. Climstein, “Regularized linear and gradient boosted ensemble methods to predict athletes’ gender based on a survey of masters athletes,” Model Assisted Statistics and Applications, vol. 14, no. 1, pp. 47–64, 2019.

[40] N. Landwehr, M. Hall, and E. Frank, “Logistic model trees,” Machine Learning, vol. 59, no. 1-2, pp. 161–205, 2005.

[41] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regres- sion: A statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp. 337–407, 2000.

[42] D. Malerba, A. Appice, M. Ceci, and M. Monopoli, “Trading-off local versus global effects of regression nodes in model trees,” in Proc. 13th Int’l Symposium on Foundations of Intelligent Systems, Lyon, France, Jun. 2002, pp. 393–402.

[43] Y. Shi, J. Li, and Z. Li, “Gradient boosting with piece-wise linear regression trees,” in Proc. 28th Int’l Joint Conf. on Artificial Intelligence, Macao, China, Aug. 2019, pp. 3432–3438.

[44] G. I. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine Learning, vol. 40, no. 2, pp. 159–196, 2000.

[45] M. A. Carreira-Perpinan and P. Tavallali, “Alternating optimiza- tion of decision trees, with application to learning sparse oblique trees,” in Proc. Advances in Neural Information Processing Systems, Montreal, Canada, Dec. 2018, pp. 1211–1221.

[46] M. A. Carreira-Perpinan and A. Zharmagambetov, “Ensembles of bagged TAO trees consistently improve over random forests, AdaBoost and gradient boosting,” in Proc. of the 2020 ACM-IMS on Foundations of Data Science Conference, Virtual, Oct. 2020, pp. 35––46.

[47] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1994.

[48] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, vol. 57, no. 1, pp. 289–300, 1995.

APPENDIX A SUPPLEMENTARY ALGORITHMS

This section presents pseudo-code of additional algorithms introduced in this paper.

Input: training samples, where ; , candidate value pool of the minimum number of samples at a leaf; , candidate value pool of the number of hidden nodes; , candidate value pool of the regularization parameter ; (optional) MaxNumLeaf, the maximum number of leaves, default NULL.

Output: A BoostTree-ELM.

Let and denote the maximal and minimal value of , respectively; Draw a random cut-point s uniformly in ; ; ; if and then

Calculate in (13); if splitthen , split; end end end if splitthen ); ); node; node; NumLeaf = NumLeaf + 1; Add node.leftChild and node.rightChild to LeafList; end Pop node from LeafList; if |LeafList| > 0 and (MaxNumLeaf == NULL or NumLeaf < MaxNumLeaf) then

Use (9) to calculate the leaf loss of each node in LeafList; Identify , the leaf node with the highest loss; ; end return node; }

Input: training samples, where ; , candidate value pool of the minimum number of samples at a leaf; , candidate value pool of the regularization parameter; , candidate value pool of the slack variable; (optional) MaxNumLeaf, the maximum number of leaves, default NULL.

Output: A BoostTree-SVR

Let and denote the maximal and minimal value of , respectively; Draw a random cut-point s uniformly in ; ; ; if and then

Calculate in (13); if splitthen , split; end end end if splitthen ); ); node; node; NumLeaf = NumLeaf + 1; Add node.leftChild and node.rightChild to LeafList; end Pop node from LeafList; if |LeafList| > 0 and (MaxNumLeaf == NULL or NumLeaf < MaxNumLeaf) then

Use (9) to calculate the leaf loss of each node in LeafList; Identify , the leaf node with the highest loss; ; end return node; }

Input: , sample set of the current node; , ensemble of the models along the path from the root node to the parent node of the current node; , the regularization parameter; M, the number of hidden nodes.

Output: ELM model for the current node.

Supplementary Algorithm 5: BoostForest-ELM for regression.

Input: training samples; n

Supplementary Algorithm 7: BoostForest-SVR for regression.

Input: training samples; n

APPENDIX B SUPPLEMENTARY FIGURES

APPENDIX C SUPPLEMENTARY TABLES

Supplementary Table 2 Parameter settings for RandomForest, Extra-Trees, XGBoost, LightGBM, GBDT-PL, MultiBoosting, DeepBoosting, FilterBoost, LMT, M5P, BaggedTAO-l and BoostForest. N is the number of samples.

Supplementary Table 3 Performances on the 35 datasets, when BoostForest is compared with Bagging-LightGBM. The best performance is marked in bold. statistically significant win for BoostForest.

Supplementary Table 4 Mean and standard deviation (in parentheses) of the classification accuracy, when BoostForest is compared with BaggedTAO-l. The best performance is marked in bold. indicates statistically significant win for BoostForest.

Supplementary Table 5 Performances of the eight approaches on the 32 datasets. The best performance is marked in bold. indicates statistically significant win for BoostForest.

Supplementary Table 6 Mean and standard deviation (in parentheses) of the regression RMSE, when ELM or SVR is used to replace RR in BoostTree. The best performance is marked in bold. indicates statistically significant win for BoostForest-ELM or BoostForest-SVR.

Supplementary Table 8 Mean and standard deviation (in parentheses) of the regression RMSE, when BoostForest (w/ clipping) is compared with BoostForest (w/o clipping). The best performance is marked in bold. indicates statistically significant win for BoostForest (w/ clipping).

Supplementary Table 9 Performances on the 30 datasets, when BoostForest (XGB-SC) is compared with BoostForest (Gini-SC) and BoostForest (MSE-SC). The best performance is marked in bold. indicates statistically significant win for BoostForest (XGB-SC).

designed for accessibility and to further open science