Visualizing Random Forest with Self-Organising Map

2014·Arxiv

Abstract

Abstract

Random Forest (RF) is a powerful ensemble method for clas-sification and regression tasks. It consists of decision trees set. Although, a single tree is well interpretable for human, the ensemble of trees is a black-box model. The popular technique to look inside the RF model is to visualize a RF proximity matrix obtained on data samples with Multidimensional Scaling (MDS) method. Herein, we present a novel method based on Self-Organising Maps (SOM) for revealing intrinsic relationships in data that lay inside the RF used for classification tasks. We propose an algorithm to learn the SOM with the proximity matrix obtained from the RF. The visualization of RF proximity matrix with MDS and SOM is compared. What is more, the SOM learned with the RF proximity matrix has better classification accuracy in comparison to SOM learned with Euclidean distance. Presented approach enables better understanding of the RF and additionally improves accuracy of the SOM.

Keywords: Random Forest, Self-Organising Maps, visualization, clas-sification, proximity matrix

1 Introduction

Nowadays, there is a need for efficient data mining techniques. The human readability of the model is an important factor of a good data mining algorithm. Among various data mining methods very popular are decision trees [20], [3]. Although, they have an easy interpretable model, a single tree does not always obtain the highest accuracy. To overcome this problem, various ensemble methods were proposed. Among them, the popular is Random Forest (RF) proposed by Leo Breiman [2]. The RF builds a set of trees using bagging and random subspace methods. The final output is a mode of responses from all individual trees. The RF can be used for classification and regression tasks. Despite the high accuracy of the RF, the human readability of the model is lost. There exist some methods to look inside RF black-box, like: examining variable importance [4], parallel cooridinate plots by variable [2] or visualizing the RF proximity distance matrix with Multidimensional Scaling [6]. Herein, we propose a novel method for visualizing the RF proximity matrix based on Self-Organising Maps (SOM) [8]. The SOM is an artificial neural network model that maps high-dimensional input data space onto usually two-dimensional lattice of neurons in an unsupervised way. Although, the SOM is an originally unsupervised algorithm there exist supervised extensions [9], [14], [13], [7]. The SOM has been proved as an efficient data mining tool in many real life applications [11], [12], [18], [19]. In this paper, we focus on using the labeled SOM model for mapping the RF used in classification tasks. The RF proximity matrix will be used for the SOM learning. It was shown that using more sophisticated distance metric than Euclidean can improve the accuracy of the SOM [15]. The RF proximity matrix was used earlier for improving clustering accuracy. Horvath et al. [16] presented method for building clusters from the RF learned with unlabeled data and successfully used it for tumor detection [17]. Moosmann et al. [10] used the RF for efficient segmentation of images, where leaves were assigned to distinct image regions rather than to specific class. Gray et al. [5] used the RF proximity matrix and the MDS for classification of medical images of different types of dementia. The paper is organized as follow: firstly, we describe the SOM and the RF algorithms; secondly, proposed approach for learning the SOM with the RF proximity matrix is presented; then, the MDS vs the SOM visualization and the accuracy of the SOM learned with Euclidean metric and the RF proximity matrix are compared.

2 Methods

Let’s denote data set as , where is an attribute vector, , M is attribute vector length and is a discrete class number of i-th sample, i = [1, 2, ..., N] and c = [1, 2, ..., C].

2.1 Self Organising Maps

In this paper, we used the SOM as a two-dimensional grid of neurons. Each neuron is represented by a weights vector , where (p, q) are indices of the neuron in the grid. It is important to notice, that neuron’s weights vector has the same length as sample’s attribute vector in the data set. The neuron’s weights directly corresponds to attributes in the data set. In the training phase, for each sample we search for a neuron which is the closest to the i-th sample. In the original SOM algorithm the distance is computed with squared Euclidean distance by following equation:

The neuron (p, q) with the smallest distance to i-th sample is so-called the Best Matching Unit (BMU), and we note its indicies as (r, v). Once the BMU is found, the weights update step is executed. The weights of each neuron are updated with formula:

where t is an iteration number and is a learning coefficient and is a neigh- bourhood function. We assume that one iteration is a presentation of one training sample, whereas a presentation of all training samples is one learning epoch. The learning coefficient is decreased between consecutive epochs to improve network’s ability to remember patterns. It is described by:

where is the initial step size, e is the current epoch number and is responsible for regulating the speed of the decrease. The neighbourhood function controls changing of weights with respect to the distance to the BMU in the grid. It is noted as:

where describes the neighbourhood function width. This parameter is increasing during learning ) - it assures that neighbourhood becomes narrower during the training. The network is trained till chosen number of learning procedure epochs is exceeded.

In the described algorithm the class label information is not used. The simplest approach of using the SOM as a classifier is to label the neurons after the unsupervised training. For each neuron we remember the overal sum of neighbourhood values from each class over all samples. The label of major class is assigned to the neuron. In the testing phase, the input sample’s class is designated based on the class of the found BMU.

2.2 Random Forest

In the RF algorithm a set of single trees is built. The process of constructing one tree can be described in the following steps:

all N training samples. 2. Determine a decision at node using only m attributes, where m is smaller

than M. The split is selected based on maximal Information Gain. 3. Move the data through the node with respect to decision from the step 2. 4. Repeat steps 2, 3 till full tree is grown.

At the end of tree constructing, the class label is assigned to each leaf based on a class of samples in it. In the testing phase, a new sample is pushed down through all trees. From each tree a class label is remembered based on class of the reached leaf. The final response is the mode of votes from all trees. The proximity matrix Prox, with size NxN, can be easily obtained by putting all samples down the all trees. If two samples i and j are in the same terminal node in the tree, their proximity is increased by one Prox(i, j) = Prox(i, j) + 1. After the presentation of all samples the proximities are divided by the number of trees in the RF. The greater proximity value is, the more similar samples are. The dissimilarity measure can be formulated as ).

2.3 Self Organising Maps learned with Random Forest (RF-SOM)

In the proposed approach we assume that the RF is already learned. The learning of the network in one epoch can be summarized in the following steps:

1. Build a data set H as a union of all network’s weights W and attribute vector of j-th sample, . The matrix W size is LxM, where the L is a total number of neurons in the network. In this matrix each row contains weights from one neuron, the mapping from neurons’s 2D grid to matrix W is assumed.

2. For set H compute dissimilarity matrix using the RF. The size is (L + 1)x(L + 1).

3. Find the smallest distance to the neurons in dissimilarity matrix , in distances corresponding to j-th sample:

where v is an index of BMU in the matrix W, which can be mapped into r, v indices in the network 2D grid. 4. Update the network weigths with formula 2. 5. Repeat steps 1-4 for all samples in the training set.

After the end of the SOM learning, it is labeled as described in Section 2.1. In the testing phase, for input sample the BMU search is performed by taking the steps 1-3. The output class label is the same as the BMU class. We will denote the proposed method as RF-SOM. The computational complexity of using the Euclidean distance in the SOM is ), because we need to compute for each sample, from N samples, a distance between sample and L neurons. Whereas, the using of the RF proximity matrix in the proposed RF-SOM has complexity , where is a number of nodes in the tree. In the RF-SOM for each sample we propagate L + 1 input vectors through T trees in the RF, and passage through a tree has complexity )). The complexity of distance computation using the RF proximity matrix is worse than using Euclidean distance. Although, it is beneficial when compared to the memory complexity of the MDS used for RF proximity visualization, which uses ) memory. The RF-SOM requires only ). This discards the MDS as a method of the RF proximity matrix visualization for large data sets.

3 Results

We used 6 real data sets to examine properties of the RF-SOM. There were used data sets: ’Glass’, ’Wine’, ’Iris’, ’Sonar’, ’Ionosphere’, ’Pima’ from the ’UCI Machine Learning Repository’ [1]. In the Table 1 are presented data sets properties. In all experiments we used following parameters values for each SOM type: 0345, 008. We will denote a network learned with Eculidean distance as SOM. The network sizes for the SOM and the RF-SOM for each data set are equal, they are presented in the Table 1. The network sizes were chosen arbitrarily because selecting optimal network size is not in the scope of this paper. In all cases the SOM and the RF-SOM starts learning from the same initial weights values. The RF was constructed with 100 trees and for all data sets.

Table 1: The description of data sets used in experiments and network size used for each data set.

To present visulization properties of the proposed method we used ’Pima’ data set. In the Fig.1 there are presented: the SOM (Fig.1a) and the MDS (Fig.1c) both learned with Euclidean distance; and RF-SOM (Fig.1b) and RFMDS (Fig.1d) constructed using the RF proximity matrix. The SOM networks were presented as a 2D grid of neurons, where for each neuron, its weigths are presented by a polar area diagram (sometimes called coxcomb plot). In the MDS and the RF-MDS plots information about points distribution in reduced 2D space is available. However, information about how point’s position is affected by combination of attributes values is missing. What is more, there is hard to find crisp border to distinguish two classes on the MDS neither on the RF-MDS. In contrary to MDS technique, the SOM and RF-SOM plots do not provide information about explicit point distribution but rather a mapping of attributes combination onto 2D grid of neurons. After network labeling the class labels are assigned to neurons, therefore distinguishing specific combination of attributes in each class is possible. As expected, although, the SOM and the RF-SOM started learning from the same initial weights values they have different final distribution of attributes combinations across the network.

To compare the accuracy of the SOM learned with Euclidean distance and RF-SOM learned with RF proximity matrix we use information about samples class label. We measure the accuracy of classification. The results of comparison are presented in the Table 2. All of the results are mean over 10-fold cross validation. In the Table 2 we also include the accuracy of the alone RF as the reference. The RF-SOM on 4 data sets (’Glass’, ’Sonar’, ’Ionosphere’, ’Pima’) obtained better results than the SOM. The greatest improvement over the SOM was achived on ’Sonar’ set, it was 12.57%. The same performance of the SOM

Fig. 1: The visualizations of ’Pima’ data set with (a) the SOM, (b) the RF-SOM, (c) the MDS with Euclidean distance, (d) the RF-MDS. The class information is coded in red color for the first class and blue color for the second class in all plots. In the (a) and (b) for polar area diagram the attribute’s number corresponding to the neuron’s weight is shown in the legend.

and the RF-SOM was on ’Wine’ and ’Iris’, when on the latter the RF-SOM has higher standard deviation value. It is worth noting, that the improvement of the RF-SOM over the SOM depends on the accuracy of the RF. On data sets (like ’Sonar’ or ’Glass’) where the accuracy difference between the RF and the SOM is high, the RF-SOM noted large improvement over the SOM. Whereas, on data sets (’Wine’ and ’Iris’) with small accuracy difference between the RF and the SOM, the RF-SOM obtained the same mean accuracy as the SOM.

Table 2: The classification accuracy for RF, SOM and RF-SOM. The results are mean and std. over 10-fold cross validation.

We measure classfication accuracy for different number of trees in the RF to examine the influence of a number of trees in the RF to performance of the RF-SOM. The accuracy for the RF and the RF-SOM for a number of trees in the RF, T = {10, 20, 50, 100, 200, 500}, is presented in Fig.2. It can be observed that for all data sets except ’Sonar’ and ’Glass’ the accuracy of the RF does not depend on T. The good results of classification are obtained even for 10 trees in the RF. For ’Sonar’ and ’Glass’ sets the accuracy of the RF increases with increasing a number of trees. The very similar behaviour can be observed for the RF-SOM. The accuracy for ’Wine’, ’Iris’, ’Ionosphere’ and ’Pima’ slightly varies with T growing. However, for ’Sonar’ and ’Glass’ sets the RF-SOM obtains better results if more trees are used in the RF. The observed behaviour can be explained by more complex data relationships in ’Sonar’ and ’Glass’ sets which are better modeled with greater number of trees in the RF.

Fig. 2: The classification accuracy for (a) RF and (b) RF-SOM, for a different number of trees in the RF. The result are mean over 10-fold cross validation.

4 Conclusions

The novel method for visualizing the RF proximity matrix by the SOM was proposed. The RF-SOM method uses the RF to compute distances between input sample and neurons. The proposed method of visualization provide better understanding of relationship between data in the RF structure than the MDS. The RF-SOM contrary to the MDS provides a mapping of data onto 2D neurons grid. In case of new coming samples there is no need to recompute the whole RF proximity matrix like in the MDS method. Additionally, the proposed method has lower memory complexity than the MDS, which for large data sets is not applicable. What is more, the experimental results show that the RF-SOM learned with the RF dissimilarity gained better or the same accuracy than the SOM learned with Euclidean distance. As pointed in [16] the RF dissimilarity has attractive features: it can handle mixed variable types well, is invariant to monotonic transformations of the input variables and is robust to outliers. The attractiveness of RF dissimilarity and obtained results with RF-SOM encourage to focus our future work on using the RF proximity matrix in other clustering algorithms.

5 Acknowledgements

PP has been supported by the European Union in the framework of European Social Fund through the Warsaw University of Technology Development Programme.

References

1. Asuncion, A., Newman, D.J.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2007)

2. Breiman, L.: Random Forests, Machine Learning, vol.45, pp 5–32 (2001)

3. Czajkowski, M., Grze´s, M., Kretowski, M.: Multi-Test Decision Trees for Gene Expression Data Analysis, Lecture Notes in Computer Science, vol.7053, pp 154-167 (2012)

4. Genuer, R., Poggi, J.-M., Tuleau-Malot, C.:Variable selection using random forests, Pattern Recognition Letters, vol.31, pp 2225–2236 (2010)

5. Gray, K.R., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random Forest-Based Manifold Learning for Classification of Imaging Data in Dementia, Lecture Notes in Computer Science, vol.7009, pp 159-166 (2011)

6. Liaw, A., Wiener, M.: Classification and Regression by randomForest, R News, vol.2, pp 18-22 (2002)

7. K¨astner, M., Villmann, T.: Fuzzy Supervised Self-Organizing Map for Semisupervised Vector Quantization, Lecture Notes in Computer Science, vol.7267, pp 256-265 (2012)

8. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE, vol.78, pp 1464-1480 (1990)

9. Midenet, S., Grumbach, A.: Learning Associations by Self-Organization: The LASSO model, Neurocomputing, vol.6, pp 343-361 (1994)

10. Moosmann, F., Nowak, E., Jurie, F.: Randomized Clustering Forests for Image Classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, pp 1632 - 1646 (2008)

11. Olszewski, D., Kacprzyk, J., Zadrozny, S.: Employing Self-Organizing Map for Fraud Detection, Lecture Notes in Computer Science, vol.7894, pp 150-161 (2013)

12. Olszewski, D.: An Experimental Study on Asymmetric Self-Organizing Map, Lecture Notes in Computer Science, vol.6936, pp 42-49 (2011)

13. Osowski, S., Linh, T.H. (2000). Fuzzy Clustering Neural Network for Classification of ECG Beats. In International Joint Conference on Neural Networks, 26-32.

14. P�lo´nski, P., Zaremba, K.: Self-Organising Maps for Classification with MetropolisHastings Algorithm for Supervision, Lecture Notes in Computer Science vol.7665, pp 149–156 (2012)

15. P�lo´nski, P., Zaremba, K.: Improving Performance of Self-Organising Maps with Distance Metric Learning Method. Lecture Notes in Computer Science, vol.7267, pp 169-177 (2012)

16. Shi, T., Horvath, S.: Unsupervised Learning with Random Forest Predictors, Journal of Computational and Graphical Statistics, vol.15, pp 118-138 (2006)

17. Shi, T., Seligson, D., Belldegrun, A.S., Palotie, A., Horvath, S.: Tumor classifica-tion by tissue microarray profiling: random forest clustering applied to renal cell carcinoma, Modern Pathology, vol.18, pp 547–557 (2005)

18. Szyma´nski, J., Duch, W.: Self Organizing Maps for Visualization of Categories, Lecture Notes in Computer Science, vol.7663, pp 160-167 (2012)

19. Szyma´nski, J.: Self–Organizing Map Representation for Clustering Wikipedia Search Results, Lecture Notes in Computer Science, vol.6592, pp 140-149 (2011)

20. Quinlan, J.R.: C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, (1993)

designed for accessibility and to further open science