High dimensional data can pose a significant challenge to learning methods due to the curse of dimensionality [1]. Feature selection is a prominent dimensionality reduction technique that selects a small subset of features based on certain relevancy criteria. Apart from reducing data dimensionality, feature selection provides insights into the data, prevents over-fitting and reduces computational costs for learning, which ultimately results in better learned models. Depending on whether there is label information available, feature selection can be classified into two categories: supervised and unsupervised. Supervised feature selection procedures are broadly classified into three groups, wrapper, filter and embedded methods [2]. Wrapper procedures select features for a specific learning model. Filter methods on the other hand are classifier agnostic. Feature selection and model learning are treated as two separate steps. These procedures rely on statistical characteristics of the data such as correlation, distance and information, to select the most important features. Embedded procedures incorporate feature selection as part of the learning model, as seen in neural nets. We focus on the model-independent filter procedures for feature selection, because of their classifier independence, simplicity and computational efficiency [3]. Specifically, we consider Mutual Information (MI) based criteria for feature selection. MI is a probabilistic measure that captures the ‘correlation’ between random variables (see Figure (1)). Whereas standard correlation captures linear relationships between variables, MI can capture non-linear dependencies between variables [4]. Since our aim is to develop better classifier models using feature selection, we select the best subset of features that together have the highest MI with the class variable. Estimating MI between a subset of features requires the estimation
of high dimensional joint probability distributions, which in turn necessitates exponentially large amounts of data. We circumvent this hurdle by invoking the assumption of conditional independence between the features. This reduces the MI estimation problem to a Conditional Mutual Information (CMI) estimation problem, involving three features at a time. However, selecting the subset of features with the highest CMI turns out to be a very hard combinatorial problem. We model feature selection as a Binary Quadratic Problem (BQP), which is NP-hard. We introduce approximations to solve the BQP taking inspiration from solutions to other related NP-hard problems like k-Sparse-PCA and Densest-k-Subgraph. We also evaluate our procedure by comparing it with less optimal, but computationally more efficient iterative procedures for feature selection. We evaluate our feature selection using classification accuracies across multiple datasets.
Motivation: Theoretical underpinnings of using CMI for feature selection have not been clearly outlined in literature. Also, current global feature selection methods like Semidefinite Programming (SDP) do not scale well. We therefore aim to provide better insights into the approximations and assumptions involved in CMI based feature selection, and introduce efficient approximations to solving the NP-hard problem of feature selection from related problems.
Contributions: We have demonstrated that the class posterior distribution can be approximated by selecting a subset of variables with the highest MI with the class variable. We have proved that the MI between a subset of variables and the class random variable, reduces to CMI between pairs of variables under the assumptions of conditional and class-conditional independence. We have modeled the NP-hard problem of feature selection as a Binary Quadratic Problem (BQP) and demonstrated our feature selection method across multiple datasets.
The rest of our paper is organized as follows. In Sec. II, we outline the problem of feature selection and formulate the BQP and its approximate solutions. Sec. III compares our work with existing feature selection procedures. We conclude with experiments in Sec. IV followed by discussion in Sec. V.
We consider the standard classification setting where we are presented with i.i.d data . Each data point
, is regarded as an instance of a set of n continuous random variables
. We interchangeably use the terms features or variables to denote X or its subsets. The dependent class label
, is considered to be an instance of a discrete random variable Y , that takes
Fig. 1: Venn diagram depicting entropy interaction. H(X) = {a,e,g,d}, H(Y)={b,e,g,f}, H(Z)={c,d,g,f}, I(X;Y)={e,g}, I(X;Z) = {d,g}, I(Y;Z) = {g,f}, H(X,Y,Z) = {a,b,c,d,e,f,g}
Information Theory Basics: Information of a random variable X is given by . Entropy H(X), characterizes the uncertainty about the random variable X. It is the expected information content for a random variable X with, H(X) = E[I(X)] =
. Mutual Information(MI) between two random variables X and Y , is a measure of information shared between them and is represented as I(X; Y ). It is symmetric with I(X; Y ) = I(Y ; X). In terms of entropy, MI is defined as
H(X|Y ), where H(X|Y ) is the conditional entropy. MI between random variables X and Y can also be understood as the reduction in entropy of X (or Y ) due to the presence of Y (or X). Conditional Mutual Information(CMI) denoted as I(X; Y |Z), is the expected MI of two random variables X and Y , given a third random variable Z.
values in [1, . . . , c]. We define S to be a subset of k feature indices, where and
, the subset of k features indexed by S. Likewise,
is the subset of left over indices, i.e. {1, . . ., n}\S and
is the subset of leftover
features indexed by
, i.e.
. We consider p(Y |X), to be the posterior probability on the dataset D. We now state our problem.
Problem 2.1: To estimate the optimal subset {1, . . . , n}, of feature indices, such that
is a good approximation to p(Y |X).
Theorem 2.1: , if and only if Mutual Information
When viewed as elements in a Bayesian Network, the random variables can be thought to form a Markov Chain where Y is conditionally independent of
given
. Therefore, the CMI
. With
, the necessary condition is proved. The statements in the proof are also true in the reverse order.
Theorem(2.1) provides the justification for choosing a MI based approach to determining the optimal depends only on D and is a constant that is shared between
and
. It is very likely that
is a non-zero value for every S except for the most trivial case with
. For a fixed value of k, we therefore try to estimate the optimal subset S that maximizes
.
A. The Binary Quadratic Problem
Given a subset of features , in order to evaluate
, we need to estimate the joint probability distribution
, which is often intractable. To simplify the estimation of the joint distribution, we introduce the assumption of conditional independence between features in the spirit of Na¨ıve Bayes and [5]. We first introduce a new term
as the subset S without the index i also denoted as S\{i}. We now outline the assumption that will simplify the joint distribution.
Assumption 2.1: For a set of selected features , the features
are conditionally independent and class-conditionally independent given
, i.e.
and
.
Theorem 2.2: If Assumption 2.1 is true, and Xthen, I
I
Proof: The MI between X
and Y is given by:
In the above derivation, we have first applied the MI Chain Rule I(A, B; C) = I(A; C) + I(B; C|A). We have then applied the MI rule and expressed the two trailing MI terms in terms of entropy. Applying Assumption 2.1 to (2), the high order entropy terms can be replaced with summations and reduced further using
, to yield,
Since we would like to select S that maximizes (3), we can formulate the global feature selection problem as,
This is equivalent to the constrained Binary Quadratic problem,
where, Q is a non-negative matrix with
and
and the non-zero indices of the solution x, constitute S.
B. Solving the Binary Quadratic Problem
We aim to solve the BQP, where Q is a symmetric and possibly indefinite matrix with non-negtiave elements, i.e., for all
. Although this problem is well de-fined, it is highly nonconvex due to the nonconvex constraint. And it is also known that this binary quadratic problem is NP-hard [6]. So, it is difficult to find the optimal value in practice. Therefore, we aim to find an approximate solution to the BQP. Approximation methods such as linear relaxation, spectral relaxation, semidefinite programming, truncated power method and low rank bilinear approximation have been applied to solve this family of NP-hard problems. We will now briefly review some of these relaxation methods.
1) Linear Relaxation: By introducing a new variable , we are able to linearize the quadratic term. With x =
, we formulate the following:
LP1 is equivalent to BQP, which is also NP-hard. LP1 is simplified by relaxing . One of the optimality conditions is then given by,
. This relaxation reduces LP1 to LP2 which is given by,
Since , the maximum value for Linear is equivalent to the k largest column (or row) sum of Q. The solution to Linear guarantees a tight lower bound to the BQP (see appendix). In our work, we use the solution from Linear to initialize the input to other algorithms.
2) Truncated Power Method: Truncated power (TPower) method aims to find the largest k-sparse eigenvector. Given a positive semidefinite matrix A, the largest k-sparse eigenvalue can be defined as follows [7]:
Matrix A is required to be positive semidefinte, but TPower method can be extended to deal with general symmetric matrices by setting where
such that
. The truncated power method is given as follows. Starting from an initial k-sparse vector
, at each iteration t, we multiply the vector
by A and then truncate the entries of
to zeros and set the largest k entries to 1. TPower can benefit from a good starting point. We use the solution from Linear as the initial sparse vector
.
3) Low Rank Bilinear Approximation: The low rank bilinear approximation procedure has been applied to solve the k-Sparse-PCA [8] and the Densest-k-Subgraph [9]. It approximates the solution to a BQP by applying a bilinear relaxation which is given as,
BQPprovides a good (2
-approximation [9]) approximation to BQP and can be solved in polynomial time using a d-rank approximation of Q. The authors in [9] have developed the Spannogram algorithm to estimate a candidate set (termed the Spannogram S ) of vector pairs (x, y) with k features. One of the vectors from the pair that maximizes BQP
, is the bilinear approximate solution to BQP.
4) Related BQP Approximation Methods: For the sake of completeness we briefly mention two other techniques that have been applied to approximate the BQP in the domain of feature selection. The authors in [10] propose two relaxation techniques, (1) Spectral relaxation and (2) Semidefinite Programming relaxation. In Spectral Relaxation, the constraint on the values of x are relaxed to continuous values. The values of x being positive, can be interpreted as feature weights. The solution to Spectral has been shown to be the largest eigenvector of Q [10]. Semidefinite Programming relaxation (SDP) is a closer approximation to the BQP than Spectral. The BQP is approximated by a trace maximization problem using semidefinite relaxation. The approximate solution x, to the BQP, is obtained from the SDP solution through random projection rounding based on Cholesky decomposition [11]. For our experiments, we apply random rounding with 100 projections. For further details, please refer to [10].
In this section we will discuss existing filter based Mutual Information measures for feature selection. We categorize these procedures as Greedy (iterative) Selectors and Global Selectors. We limit our discussion to the best feature selectors we found in our studies. For a broader perspective on feature selection procedures, we suggest the works of Tang et al. [12], Guyon and Elisseeff [3] and Duch [13].
A. Greedy Feature Selectors
Greedy selectors usually begin with an empty set S, and iteratively add to it the most important feature indices, until a fixed number of feature indices are selected or a stopping criterion is reached. MI between the random variables (features and label) provides the ranking for the features. The most basic form of the scoring function is Maximum Relevance (MaxRel) [14], where the score is simply the MI between the feature and the class variable. To account for the redundancy between features, Peng et al. [15], introduced the Maximum Relevance Minimum Redundancy (MRMR) criterion, which selects features with maximum relevance to the label and minimum redundancy between each other. A greedy procedure very closely related to our technique is the Joint Mutual Information (JMI), that was developed by Yang and Moody [16], and later by Meyer et al. [17]. In our experiments (Sec. IV), we evaluate the features selected by these iterative procedures across multiple datasets.
B. Global Feature Selectors
There is limited work on MI based global feature selection. In [4], Rodriguez-Lujan et al. introduced the Quadratic Programming Feature Selection (QPFS). This method can be viewed as a global alternative to MRMR. The second global technique proposed by Nguyen et al. [10], is related to our method presented in (BQP). Nguyen et al. [10], model a global feature selection problem based on the CMI matrix Q, and apply Spectral and SDP methods to approximate the solution.
TABLE I: Time complexities for the Global approximate solutions for BQP in number of features n
Fig. 2: Average time in seconds for an algorithm to select k features from data containing n features in experiment (n, k).
We consider two factors to evaluate the feature selection algorithms discussed so far, viz., time complexity and clas-sification accuracies. For the global methods, since we are approximating the solution, we also consider the tightness of the approximation. We conducted our experiments using MATLAB on an Intel Core-i7 2.3 GHz processor with 16GB of memory.
A. Feature Selectors: a test of scalability
Greedy feature selectors have time complexities of the order O(nk), which is negligible compared to the time complexities of the global feature selectors. Table I lists the time complexities for the global algorithms. To study time complexities, we conduct multiple experiments (n, k), where we simulate a CMI matrix Q by a random positive symmetric matrix of size , and select k features. The time complexity for experiment (n, k), is the average time of convergence over 10 runs. We use the same set of random matrices for each of the algorithms in the experiment. Figure (2) depicts the convergence times for different experiments. Linear algorithm is the most efficient, followed closely by Spectral and TPower methods. We used the CVX [18] implementation with SDPT3 solver [19], for all our SDP experiments. The SDP solver has a huge memory footprint and with matrix sizes
, we run into ‘Out of Memory’ errors. For the LowRank method, we used the following parameters,
for all our experiments. Please refer to [9] for more details on the LowRank.
B. BQP Methods: a test of approximation
For the next set of experiments, we evaluated the degree of approximation for the global algorithms. Since we do not know the optimal solution for the BQP, we compared the methods on their relative objective values. We estimated the binary feature vector x after applying each of the methods and then evaluated the objective value . We evaluated the percentage difference of every algorithm’s objective value with the Linear method’s objective. Similar to the experimental evaluation used for time complexity, we generated random data for each experiment (n, k) and averaged the values over 10 random
Fig. 3: The average percentage difference of the BQP objective values compared with the Linear BQP objective value. In experiment (n, k), n is the matrix dimension and k is the number of features selected.
runs. Figure (3) presents the results of the experiment. TPower and LowRank displayed the largest percentage increase from the Linear. Since TPower and LowRank approximate the BQP better than other methods, they must therefore be better feature selectors compared to Linear, Spectral and SDP. The TPower is also very efficient in terms of execution time and would be the ideal feature selector when considering both speed and accuracy.
C. Feature Selectors: a test of classification error
In this section we compare the TPower and the LowRank with other algorithms in terms of classification accuracies. For our experiments, we chose 13 publicly available datasets that are widely used to study MI based feature selection as in [4], [5], [10], [15]. The details of the datasets are captured in Table II. We performed feature selection for a set of k values and estimated the classifier performance across all values of k. Starting at k = 10, we incremented in steps of 1 till n or 100, whichever is smaller. We evaluated the classifier performance using Leave-One-Out cross validation (if ) or 10-fold cross validation and obtained the cross validation errors(%) for each fold. Since the average error across all values of k is not a good measure of the classifier performance, we applied the paired t-test, as also mentioned in [4], [10], [22], across the cross validation folds. For a fixed dataset and a fixed value of k, to compare TPower with say, MaxRel, we applied the one sided paired t-test at 5% significance over the error(%) of the cross validation folds for the two algorithms. We set the performance of TPower vs. MaxRel to win = w, tie = t
TABLE II: Datasets details: n is number of features, m is number of samples, c is number of categories, Error: is average cross validation error (%) using all features.
and loss = l, based on the largest number of t-test decisions across all the k values. Along the lines of earlier studies in feature selection, we used linear SVM as the classifier. To estimate the CMI we need to discretize the features. We believe the role of discretization is not unduly critical as long as it is consistent across all the experiments. We discretized the features using the Class Attribute Interdependence Maximization (CAIM) algorithm developed by Kurgan and Cios et al. [23]. Feature selection was performed on discretized data but the classification (after feature selection) was performed on the original feature space. Using the above procedure, we compared the performance of TPower and LowRank with all the other algorithms. Tables III and IV display the results of the experiment. The values in Table III (likewise IV) correspond to the difference in the average of classification error(%) between TPower (likewise LowRank) and all the other algorithms. From the results in these tables, we find that TPower and LowRank do well on most of the datasets across all the algorithms. When compared against each other LowRank does better than TPower. For the sake of brevity we have not displayed the comparisons between other pairs of algorithms. The win/tie/loss numbers by themselves do not provide a complete picture of the comparison. The difference in the average error also needs to be taken into account to assess the performance. A large percentage of negative values in the columns and their magnitudes indicate the low error values in classification for TPower and LowRank. Figure (4) displays the average classification error(%) trends for varying values of k for 3 datasets. Figures (4a) and (4d) for the Colon dataset, suggest that the addition of more features does not necessarily reduce classification error. Classification error trends also help us cross validate the best value of k for a dataset. For a given dataset, the error trends between the global and greedy procedures follow a similar pattern. This perhaps indicates that nearly similar features are being selected using both types of methods. We also note that for huge datasets with large values of n, greedy methods may not be a bad choice for feature selection.
Feature selection is a NP-hard problem and newer methods to approximate the solution will help drive research in this area. We have demonstrated that current methods applying MI and CMI for feature selection, inherently assume conditional independence between features. The conditional independence assumption limits the number of features compared to only 2 or 3 features at a time. There is need to derive better measures to approximate the importance of a group of selected features. To estimate the probability distributions, we had to discretize the features. We have not studied the effect of discretization in our work. Progress along all of these fronts will provide directions to improve MI based feature selection. In conclusion, we can state that both TPower and LowRank perform better than existing global and iterative techniques across most of the datasets. While LowRank slightly outperforms TPower, it does not compare well with regards to time.
This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of the NSF. We express our thanks to Nguyen X. Vinh [10], for providing his implementation for comparison.
[1] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani, The elements of statistical learning. Springer, 2009, vol. 2, no. 1.
[2] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” AI, vol. 97, no. 1, pp. 273–324, 1997.
[3] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” JMLR, vol. 3, pp. 1157–1182, 2003.
[4] I. Rodriguez-Lujan, R. Huerta, C. Elkan, and C. S. Cruz, “Quadratic programming feature selection,” JMLR, vol. 11, pp. 1491–1516, 2010.
[5] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´an, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” JMLR, vol. 13, no. 1, pp. 27–66, 2012.
[6] M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1990.
[7] X.-T. Yuan and T. Zhang, “Truncated power method for sparse eigenvalue problems,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 899–925, 2013.
[8] D. S. Papailiopoulos, A. G. Dimakis, and S. Korokythakis, “Sparse pca through low-rank approximations,” arXiv:1303.0551, 2013.
[9] D. Papailiopoulos, I. Mitliagkas, A. Dimakis, and C. Caramanis, “Finding dense subgraphs via low-rank bilinear optimization,” in (ICML-14), 2014, pp. 1890–1898.
[10] X. V. Nguyen, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection,” in 20th ACM SIGKDD. ACM, 2014, pp. 512–521.
[11] M. X. Goemans and D. P. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming,” Journal of the ACM (JACM), vol. 42, no. 6, pp. 1115– 1145, 1995.
[12] J. Tang, S. Alelyani, and H. Liu, “Feature selection for classification: A review,” Data Classification: Algorithms and Applications, p. 37, 2014.
[13] W. Duch, “Filter methods,” in Feature Extraction. Springer, 2006, pp. 89–117.
[14] D. D. Lewis, “Feature selection and feature extraction for text categorization,” in Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 212–217.
[15] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy,” IEEE PAMI, vol. 27, no. 8, pp. 1226–1238, 2005.
[16] H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for non-gaussian data,” NIPS, vol. 12, 1999.
[17] P. E. Meyer, C. Schretter, and G. Bontempi, “Information-theoretic feature selection in microarray data using variable complementarity,” Selected Topics in Signal Processing, IEEE Journal of, vol. 2, no. 3, pp. 261–274, 2008.
[18] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.
[19] K.-C. Toh, M. J. Todd, and R. H. T¨ut¨unc¨u, “Sdpt3 matlab software package for semidefinite programming, version 1.3,” Optimization methods and software, vol. 11, no. 1-4, pp. 545–581, 1999.
[20] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
[21] C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” Journal of bioinformatics and computational biology, vol. 3, no. 02, pp. 185–205, 2005.
[22] G. Herman, B. Zhang, Y. Wang, G. Ye, and F. Chen, “Mutual information-based method for selecting informative feature sets,” Pattern Recognition, vol. 46, no. 12, pp. 3315–3327, 2013.
[23] L. A. Kurgan and K. J. Cios, “Caim discretization algorithm,” Knowledge and Data Engineering, IEEE Transactions on, vol. 16, no. 2, pp. 145–153, 2004.
TABLE III: Comparison of TPower with other algorithms. The table values measure the difference in average classification accuracies of TPower with other algorithms. w, t and l indicate one-sided paired t-test results. The last row displays the total number of Wins(W ), Ties(T ) and Loss(L). N/A indicates comparison data was unavailable for large datasets using SDP.
Fig. 4: Average cross validation error(%) vs. Number of features. First Row: Comparison of Greedy methods with TPower and LowRank for 3 datasets. Second Row: Comparison of Global methods across 3 datasets.
We would like to study the goodness of approximation to BQP provided by the solution to LP2. We define an equivalent problem in terms of BQP.
Proof: is the sum of all elements in Q. Since
0, BQP
. Therefore,
. Since
is a constant for a matrix, under the same set of constraints,
We define some new quantities for the derivation of the bound. Let be the solution of BQP. Let
be the solution of LP2. Since
, we can expand
in terms of any binary vector x. Specifically we define
in terms of
,
Lemma A.1:
Let denote the maximum value of BQP and let
denote maximum value of LP2. If
is the solution of BQP and
is the solution of LP2. We have the following result:
Lemma A.2:
We are now ready to state the bound for LP2.
Theorem A.1:
The last statement (16) is arrived at by adding on both sides. Since
implies that
is a lower bound for
. We are therefore guaranteed a lower bound for QBP by solving LP2 and (16) provides the tightness of the bound.