Random forest (RF) (Breiman, 2001; Liaw and Wiener, 2002) has been widely used in many fields including bioinformatics applications (Riddick et al., 2011; Yuan et al., 2012). RF is able to handle mixed categorical and numerical features, multiple classes, are insensitive to the scale of features, and have been considered as a powerful supervised learner.
RF can provide importance scores of features to understand the contribution of each feature. However, there can be a huge number of features for high-dimensional problems (all the gene data sets considered in our experiments have more than 1000 features), and it is challenging to investigate the importance scores from thousands of features. Therefore, it is desirable to develop a feature selection algorithm for RF. .
The guided regularized random forest (GRRF) proposed by Deng and Runger (2013) uses the importance scores from an RF built on the complete training data to complement the information gain in a local node. However, the trees in GRRF can be highly
correlated and GRRF can not be built in parallel (Deng and Runger, 2013).
The guided random forest (GRF), proposed in this work, is a solution of the issues mentioned above. GRF is guided by the importance scores from an RF, and each tree in GRF is built independently from another tree. Experiments on 10 gene data sets show conclusive results that GRF uses many fewer features than RF, and RF applied to features selected by GRF is more accurate than RF.
Let denote the Gini information gain of using a feature
split a tree node. The key idea of GRF is weighting
importance scores from an RF.
where is the importance score of
from an RF,
maximum importance score,
is the normalized importance score, and
controls the weight of the importance scores from RF. It can be seen that, features with smaller importance scores are penalized more in GRF, and the penalty increases as
increases (GRF becomes RF when
= 0). In this work I use the maximum penalty (i.e.,
), in order to use a small number of features in GRF. So
Note the key difference between GRF and GRRF is that the features used in previous trees have an impact on the current tree for GRRF, but does not have any impact for GRF. The features used in a GRRF model are expected to be relevant and non-redundant, while the features used in a GRF model are expected to be relevant, but not necessarily non-redundant.
Code 1 shows an example of using GRF () for feature selection. In the code, a classification data set with 500 features is simulated, and only 2 features are relevant to the class. While RF uses all the features and misclassifies 54 out of 250 instances, RF uses 196 features selected by GRF and misclassifies 34.
from the results of Deng and Runger (2013) due to randomness. Table 1 shows the average error rates of different methods. GRFRF outperforms RF on 9 data sets, 7 of them have significant differences at the 0.05 level. The advantage of GRF-RF over GRRF and GRRF-RF is also clear. GRF-RF also outperform GRF, and therefore applying RF to features selected by GRF is better than GRF as a classifier.
Table 2. The number of instances, classes and features of the data sets, and the number of features used in GRF and RF.
Table 2 summarizes the data sets and shows the number of features used in different models. RF uses a subset of features in the model, and GRF uses a even smaller number of features in the model.
The guided random forest (GRF) is proposed here for feature selection, particularly, for gene classification in this work. Experiments show that GRF-RF not only significantly outperforms RF in accuracy performance, but also uses many fewer features in the model. In this work I discuss the advantages of GRF for high-dimensional gene data sets. It may also be valuable to find other cases where GRF has advantages over other methods, with the option of tuning the parameter in Equation (2) (fixed as 1 here). Furthermore, in this work,
is determined by the importance score of feature
from an ordinary random forest. However,
specified by other ways too, e.g., F-score or human knowledge.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Deng, H. and Runger, G. (2013). Gene selection with guided regularized random forest. Pattern Recognition. to appear.
D´ıaz-Uriarte, R. and De Andres, S. (2006). Gene selection and
classification of microarray data using random forest. BMC bioinformatics, 7(1), 3.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
Riddick, G., Song, H., Ahn, S., Walling, J., Borges-Rivera, D.,
Zhang, W., and Fine, H. A. (2011). Predicting in vitro drug sensitivity using random forests. Bioinformatics, 27(2), 220–224.
Yuan, Y., Xu, Y., Xu, J., Ball, R. L., and Liang, H. (2012).
Predicting the lethal phenotype of the knockout mouse by
integrating comprehensive genomic data. Bioinformatics, 28(9), 1246–1252.