b

DiscoverSearch
About
My stuff
High-Dimensional Feature Selection for Genomic Datasets
2020·arXiv
Abstract
Abstract

A central problem in machine learning and pattern recognition is the process of recognizing the most important features. In this paper, we provide a new feature selection method (DRPT) that consists of first removing the irrelevant features and then detecting correlations between the remaining features. Let D = [A | b] be a dataset, where b is the class label and A is a matrix whose columns are the features. We solve Ax = b using the least squares method and the pseudo-inverse of A. Each component of x can be viewed as an assigned weight to the corresponding column (feature). We define a threshold based on the local maxima of x and remove those features whose weights are smaller than the threshold. To detect the correlations in the reduced matrix, which we still call A, we consider a perturbation ˜A of A. We prove that correlations are encoded in ∆x =| x − ˜x |, where ˜x is the least quares solution of ˜A˜x = b. We cluster features first based on ∆x and then using the entropy of features. Finally, a feature is selected from each sub-cluster based on its weight and entropy. The effectiveness of DRPT has been verified by performing a series of comparisons with seven state-of-the-art feature selection methods over ten genetic datasets ranging up from 9,117 to 267,604 features. The results show that, over all, the performance of DRPT is favorable in several aspects compared to each feature selection algorithm.

Keywords: feature selection, dimensionality reduction, perturbation theory,

singular value decomposition, disease diagnoses, classification

Supervised learning is a central problem in machine learning and data mining [1]. In this process, a mathematical/statistical model is trained and generated based on a pre-defined number of instances (train data) and is tested against the remaining (test data). A subcategory of supervised learning is classification, where the model is trained to predict class labels [2]. For instance, in tumor datasets, class labels can be malignant or benign, the former being cancerous and the latter being non-cancerous tumors [3]. For each instance in a classification problem, there exists a set of features that contribute to the output [4].

In high-dimensional datasets, there are a large number of irrelevant features that have no correlation with the class labels. Irrelevant features act as noise in the data that not only increase the computational costs but, in some cases, divert the learning process toward weak model generation [5, 6]. The other important issue is the presence of correlation between good features, which makes some features redundant. Redundancy is known as multicollinearity in a broader context and it is known to create overfitting and bias in regression when a model is trained on data from one region and predicted on another region [7, 8].

The goal of feature selection (FS) methods is to select the most important and effective features [9]. As such, FS can decrease the model complexity in the training phase while retaining or improving the classification accuracy. Recent FS methods [10, 11, 12] usually find the most important features through a complex model which introduce a more complicated framework when followed by a classifier.

In this paper, we present a linear FS method called dimension reduction based on perturbation theory (DRPT). Let D = [A | b] be a dataset where b is the class label and A is an  m × nmatrix whose columns are features. We shall focus on datasets where m << n and of particular interests to us are genomic datasets where gene expression level of samples (cases and controls) are measured. So, each feature is the expression levels of a gene measured across all samples. Biologically speaking, there is only a limited number of genes that are associated to a disease and, as such, only expression levels of certain genes can differentiate between cases and controls. So, a majority of genes are considered irrelevant. One of the most common methods to filter out irrelevant features in genomic datasets is using p-values. That is, one can look at the expression levels of a gene in normal and disease cases and calculate the p-values based on some statistical tests. It has been customary to conclude that genes whose p-values are not significant are irrelevant and can be filtered out. However, genes expressions are not independent events (variables) and researchers have been warned against the misuse of statistical significance and p-values, as it is recently pointed out in [13, 14].

We consider the system Ax = b where the rows of A are independent of each other and Ax = b is an underdetermined linear system. This is the case for genomic datasets because each sample has different gene expressions from the others. Since Ax = b may not have a unique solution, instead we use the least squares method and the pseudo-inverse of A to find the solution with the smallest 2-norm. One can view each component  xiof x as an assigned weight to the column (feature)  Fiof A. Therefore, the bigger the  |xi|the more important Fiis in connection with b.

It then makes sense to filter out those features whose weights are very small compared to the average of local maximums over  |xi|’s. After removing irrelevant features, we obtain a reduced dataset, which we still denote it by [A | b]. In the next phase, we detect correlations between columns of A by perturbing A using a randomly generated matrix E of small norm. Let ˜x be the solution to (A+E)˜x = b. It follows from Theorem 3.3 that features  Fiand  Fjcorrelate if and only if  |xi − ˜xi|and  |xj − ˜xj|are almost the same. Next, we cluster ∆x = |x − ˜x|using a simplified least-squares method called Savitsky-Golay smoothing filter [15]. This process yields a step-wise function where each step is a cluster. We note that features in the same cluster do not necessarily correlate and so we further break up each cluster of ∆x into sub-clusters using entropy of features. Finally, from each sub-cluster, we pick a feature and rank all the selected features using entropy.

The “stability” of a feature selection algorithm is recently discussed in [16]. An algorithm is ‘unstable’ if a small change in data leads to large changes in the chosen feature subset. In real datasets, it is possible that there are small noise or error involved in the data. Also, the order of samples (rows) in a dataset should not matter; the same applies to the order of features (columns). In Theorem 3.4 we prove that DRPT is noise-robust and in Theorem 3.5 we prove that DRPT is stable with respect to permuting rows or columns.

We compare our method with seven state-of-the-art FS methods, namely mRMR [17], LARS [18], HSIC-Lasso [19], Fast-OSFS [20], group-SAOLA [21], CCM [22] and BCOA [23] over ten genomic datasets ranging up from 9,117 to 267,604 features. We use Support Vector Machines (SVM) and Random Forest (RF) to classify the datasets based on the selected features by FS algorithms. The results show that, over all, the classification accuracy of DRPT is favorable compared with each individual FS algorithm. Also, we report the running time, CPU time, and memory usage and demonstrate that DRPT does well compared to other FS methods.

The rest of this paper is organized as follows. In Section 2, we review related work. We present our approach and the algorithm in Section 3. Experimental results and performance comparisons are shown in Section 4 and we conclude the paper in Section 5.

FS methods are categorized as filter, wrapper and embedded methods [24]. Filter methods evaluate each feature regardless of the learning model. Wrapperbased methods select features by assessing the prediction power of each feature provided by a classifier. The quality of the selected subset using these methods is very high, but wrapper methods are computationally inefficient. The last group consolidates the advantages of both methods, where a given classifier selects the most important features simultaneously with the training phase. These methods are powerful, but the feature selection process cannot be defused from the classification process.

Providing the most informative and important features to a classifier would result in a better prediction power and higher accuracy [25]. Selecting an optimal subset of features is an NP-hard problem that has attracted many researchers to apply meta-heuristic and stochastic algorithms [26, 27, 28].

Several methods exist that aim to enhance classification accuracy by assigning a common discriminative feature set to local behavior of data in different regions of the feature space [29, 30]. For example, localized feature selection (LFS) is introduced by Armanfard et al. [30], in which a set of features is selected to accommodate a subset of samples. For an arbitrary query of the unseen sample, the similarity of the sample to the representative sample of each region is calculated, and the class label of the most similar region is assigned to the new sample.

On the other hand, some approaches use aggregated sample data to select and rank the features [17, 18, 19, 31, 32]. The least absolute shrinkage and selection operator (LASSO) is an estimation method in linear models which simultaneously applies variable selection by setting some coefficients to zero [31].

Least angle regression (LARS) proposed by Efron et al. [18] is based on LASSO and is a linear regression method which computes all least absolute shrinkage and selection operator [31] estimates and selects those features which are highly correlated to the already selected ones.

Chen et al. [32] introduced a semi-supervised FS called Rescaled Linear Square Regression (RLSR), in which rescaling factors are incorporated to exploit the least square regression model and rank features. They solve the minimization problem shown in equation 1 to learn  Θand  YU, which are a matrix of rescaling factors and unknown labels, respectively.

image

where W is a sparse matrix that represents the importance of features, X is the dataset, Y is class labels, and b = 1n(YT 1 − WT X1), where n is number of samples in a dataset. Their proposed algorithm continuously updates W, b and  YUuntil convergence.

Yamada et al. [19] proposed a non-linear FS method for high-dimensional datasets called Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso), in which the most informative non-redundant features are selected using a set of kernel functions, where the solutions are found by solving a LASSO problem. The complexity of the original Hilbert-Schmidt FS (HSFS) is  O(n4). In a recent work [33] called Least Angle Nonlinear Distributed (LAND), the authors have improved the computational power of the HSIC-Lasso. They have demonstrated via some experiments that LAND and HSIC-Lasso attain similar classification accuracies and dimension reduction. However, LAND has the advantage that it can be deployed on parallel distributed computing. Another kernel-based feature selection method is introduced in [22] using measures of independence and minimizing the trace of the conditional covariance operator. It is motivated by selecting the features that maximally account for the dependence on the covariates’ response.

In some recent real-world applications, we need to deal with sequentially added dimensions in a feature space while the number of data instances is fixed. Yu et al. [34] developed an open source Library of Online FS (LOFS) using state-of-the-art algorithms. The learning module of LOFS consists of two submodules, Learning Features added Individually (LFI) and Learning Grouped Features added sequentially (LGF). The LFI module includes various FS methods including Alpha-investing [35], OSFS [36], Fast-OSFS [20], and SAOLA [37] to learn features added individually over time, while the LGF module provides the group-SAOLA algorithm [21] to mine grouped features added sequentially.

In [38], the authors proposed a bio-inspired optimization algorithm called Coyote Optimization Algorithm (COA) simulating the behavior of coyotes. Using their new strategy, COA makes a balance between exploration and exploitation processes to solve continuous optimization problems. Very recently, the authors [23] upgraded their method by proposing a binary version of COA called Binary Coyote Optimization Algorithm (BCOA), which is a wrapper feature selection method.

There is a great interest in the applications of FS methods in disease diagnoses and prognoses. For example, Parkinson’s disease (PD) is a critical neurological disorder and its diagnoses in its initial stages is extremely complex and time consuming. Recently, voice recordings and handwritten drawings of PD patients are used to extract a subset of important features using FS algorithms to diagnose PD [39, 40, 41, 42, 43] with a good success rate.

Consider a dataset D consisting of m samples where each sample contains n + 1 features. Let us denote by A the first n columns of D and by b the last column. We also denote by  Fithe i-th feature (column) of A. We shall first consider eliminating the irrelevant features. Throughout, by the norm of a

vector, we always mean its 2-norm. Recall that

image

Denote by  σ1 ≥ σ2 ≥ · · · ≥ σrthe singular values of A, where r = min(m, n). The smallest non-zero singular value of A is denoted by  σminand the greatest of the  σiis also denoted by  σmax. It is well-known that  ||A||2= σmax. Recall that A admits a singular value decomposition (SVD) in the form  A = UΣV T, where U and V are orthogonal matrices and Σ = diag(σ1, . . . , σr, 0, . . . ,0) is an m × ndiagonal matrix. Here,  V Tdenotes the transpose of V .

We first normalize the columns of A so that each  Fihas norm 1. Then, we solve the linear system Ax = b by using the method of least squares (see theorem 3.1). Here, x = [x1 · · · xi · · · xn]T. The idea is to select a small number of columns of A that can be used to approximate b. Since Ax = b may not have a unique solution, instead we consider a broader picture by solving the least squares problem minx||Ax − b||2whose solution is given in terms of pseudo-inverse and SVD of A. The following result is well-known, see [44].

Theorem 3.1 (All Least Squares Solutions).  Let A be an m × n matrix

image

A+b + z, where  z ∈ker(A). Furthermore, the unique solution whose 2-norm is the smallest is given by  x = A+b.

In other words, we can approximate the label column b as a linear combination  x1F1 + · · · + xnFn. So each  |xi|can be viewed as an assigned weight to Fi. Given that each  Fihas norm 1, if  |xi|is small compared to others, then the vector  xiFiwill have a negligible effect on b. It then makes sense to filter out those features whose weights are very small. In other words, we shrink the weights of irrelevant features to zero.

In this paper, we shall mostly focus on datasets where m << n. Of special interest to us are genomic datasets where there are usually tens or hundreds of samples compared to tens of thousands of genes. The matrix A in these datasets has full row-rank because gene expressions of different samples are independent of each other. Since  b ∈ Rm, it is enough to identify only m independent columns of A. Intuitively, it makes sense to eliminate the columns that are less important.

We prove below how weight of each feature  Fiis directly affected by the relevance of  Fito b. Suppose that A is an  m × nof full row-rank and consider

the SVD of A as  A = UΣV T. Let  U =� u1

form an orthonormal basis of  Rm.

Theorem 3.2. Let A be a full row-rank matrix and denote by x = [x1 . . . xn]Tthe least squares solutions to Ax = b. Then, each component  xiof x is given

image

Proof. Since A is full row-rank, the right inverse of A is  A+=  AT(AAT)−1. Consider the SVD of A as  A = UΣV T. Then  AAT=  UΣΣT U T= �mi=1 σ2i uiuTi. Note that the solution of Ax = b with the smallest norm is  x = A+b= AT(AAT)−1b. Let z = (AAT)−1b. So,  b = AAT z= �mi=1 σ2i uiuTi z. Since the

uis are orthonormal, we get

image

Since A has full row-rank, we have  σk >0, for all k. Let ¯b1, . . . ,¯bmbe the coordinates of b with respect to the basis  u1, . . . , umof  Rm. Similarly, let ¯z1, . . . , ¯zmbe the coordinates of z with respect to this basis. So, b = U[¯b1 . . .¯bm]Tand  z = U[¯z1 . . . ¯zm]T. Equation (2) can be written as ¯bk = σ2k¯uk, for each 1  ≤ k ≤ m. On the other hand,  x = A+b = AT(AAT)−1b = AT z. Since z = U[¯z1 . . . ¯zm]T, we deduce that  xi = ⟨Fi, z⟩, for each  i. □

We note that the extent to which a feature is relevant to b also depends on how important the other features are in determining b. This fact is reflected in Theorem 3.2 by taking into account the singular values of A that encode part of the information about A. Also, the definition of relevancy is not quantitative and one has to set a threshold for the degree of relevancy. We set a dynamic threshold by calculating the average of all local maxima in x and remove those features that their corresponding value  |xi|is smaller than the threshold. In a sense, the threshold is set so that rank of the reduced matrix is still the same as of the original A. So, in the reduced matrix, we only keep features that have a higher impact on b and yet the reduced matrix retains the same prediction power as A in approximating b.

If A is full row-rank then, it follows from Theorem 3.1 that the solution  xRof smallest 2-norm to the system Ax = b is in the row space of A. So, there is a vector y such that  xR=  −AT y. Hence,  xRsatisfies  xR+  AT y= 0, AxR= b.

In other words,  xRis part of the solution to the non-singular linear system  I ATA0

Next, we show how we can detect correlations between features. Recall that a perturbation of A is of the form A + E, where E is a random matrix with the normal distribution. We choose E to be a random matrix such that ||E||2≈10−sσmin(A), for some  s ≥0. We set s = 3 where our estimates are correct up to a magnitude of 10−3.

Theorem 3.3. Let x and  ˜xbe solutions of Ax = b and (A + E)˜x = b, where E is a perturbation such that such that  ||E||2= 10−sσmin(A). If a feature  Fiis independent of the rest of the features, then  |xi − ˜xi|≈0. Furthermore, suppose that features  Fjand  Fkcorrelate, say  Fj = cFkfor some scalar c. If  Fjand Fkare independent from the rest of the features, then c = xk−˜xkxj−˜xj.

Proof. From Ax = b and (A + E)˜x = b, we get  A(x − ˜x) =  E˜x. Consider the SVD of A + E which is of the form  A + E = UΣV T. So,  ˜x = V Σ−1U T b. Since U and V are orthogonal and for orthogonal matrices we have  ||Uv||2= ||v||2,

we get

image

Hence,  ||E˜x||2≤ ||E||||˜x||2≤10−sand we deduce that

image

Now, if a feature, say  Fi, is independent of the rest of features, then it follows that  |xi − ˜xi|≈0. Furthermore, since  Fjand  Fkare independent from the rest

of the features, we must have

image

Theorem 3.3 shows how we can filter out irrelevant features by looking at the components of  x − ˜xthat are close to zero. Also, as we mentioned before, we normalize the columns of A so that each  Fihas norm 1. So, if in the raw dataset  Fjand  Fkcorrelate then after normalization we must have  F′j=  ±F′k. Here,  F′j= Fj||Fj||. So, by Theorem 3.3, if  Fjand  Fkare independent from other features, we must have  |xk−˜xk|= |xj −˜xj|. We explain these notions further in a synthetic dataset. Consider a synthetic dataset with 22 features and 100 samples and the label column b which we set as b = 3F19+ 5F17+ 2F20. The first 20 features of this dataset are generated randomly in the interval of -1 and 1. The correlations between remaining features are set as follows:  F21= 2F18+ 4F19and  F22= 3F20. First, we normalize A. Then solve Ax = b and (A + E)˜x = band calculate ∆x as shown in Table 1.

Table 1: Perturbation of the synthetic Dataset

image

Let ∆x = |x − ˜x|and denote its i-th component with ∆xi. Since each of F1, . . . , F17are independent from the other features, as we expected, we have ∆xi ≈0, for all 1  ≤ i ≤17. However,  F17is relevant because it correlates with b. Indeed,  F17is a very important feature because we cannot make up for its loss using other features. Hence, we should be able to distinguish and preserve F17from other  Fifor which ∆xi= 0. This can be accomplished by noting that irrelevant features have smaller  |xi|compared to other features as can be seen from Table 1.

Since  F22= 3F20, these features correlate and are independent from other features. By normalization, we get  F′22=  F′20. So, we expect to have ∆x20 ≈x22as shown in Table 1. We deduce that  F22and  F20are dependent and so we should only choose one of them.

Finally, after normalization and rewriting the relation  F21= 2F18+ 4F19, we obtain  F′21 ≈ F′19modulo  F′18. The reason for this is because norms of  F′21and  F′19outweigh norm of  F′18. This is confirmed in Tabel 1 as ∆x19and ∆x21are closest to each other compared to the others. So,  F19and  F21fall into the same cluster of ∆x. This means that one of  F19or  F21must be removed as a redundant feature. We shall now explain the clustering process of ∆x.

By Theorem 3.3, if features  Fiand  Fjcorrelate, then the differences ∆xiand ∆xjare almost the same. That is, the correlations between features are encoded in ∆x. Now, we sort ∆x and obtain a stepwise function where each step can be viewed as a cluster consisting of features that possibly correlate with each other. To find an optimal number of steps, it makes sense to smooth ∆x where we view ∆x as a signal and use a simplified least-squares method called Savitsky-Golay smoothing filter [15]. Figure 1 exhibits how the smoothening process on ∆x preserves its whole structure without changing the trend.

We note that the converse of Theorem 3.3 may not be true in general. That is ∆xiand ∆xjbeing the same does not necessarily imply that  Fiand  Fjcorrelate. Hence, in the next step, we want to further break up each cluster of ∆x into sub-clusters. There are several ways to accomplish this step and one of the most natural ways is to use entropy of features.

Generally, entropy is a key measure for information gain and it is capable of quantifying the disorder or uncertainty of random variables. Also, entropy effectively scales the amount of information that is carried by random variables. Entropy of a feature F is defined as follows:

image

where m is the number of samples and  fkis the frequency with which F assumes the k-th value in the observations.

Figure 2(a) shows clustering the set of all features based on ∆x, and then a typical cluster splits into sub-clusters using entropy as shown in Figure 2(b).

image

Figure 1: ∆x vs. smoothed ∆x

To do so, we sort the features of a cluster based on their entropy which yields another step-wise function. At this stage, we pick one candidate feature from each sub-cluster based on the corresponding values  |xi|. Finally, the selected features are ranked based on both their entropies and the corresponding  |xi|’s. The final sorting of the features is an amalgamation of these two rankings.

3.1. Noise-robustness and stability

In real datasets, it is likely that D involves some noise. For example, in genomics, it is conceivable that through the process of preparing a genomic dataset, some error/noise is included and as such the dataset D is noisy. We note that the label column b is already known to us (and without noise). So instead of D = [A | b] we deal with D = [A1 | b], where  A1 = A + E1and ||E1||2is small (||E1||2= 10−sσmin(A)). A perturbation of  A1is of the form ˜A1 = A1 + E2, where  ||E2||2= 10−sσmin(A). Our aim is to show that if certain columns of A correlate, then so do the same columns of  A1and vice versa.

Theorem 3.4. Let ˜x, ˜ybe solutions of  A1˜x = band ˜A1˜y = b, respectively. Suppose that  S′=  {F1, . . . , Ft}is set of columns of A such that �ti=1 ciFi= 0,

image

1. any subset of  S′is linearly independent,

2.  F1, . . . , Ftare linearly independent from the remaining columns of A.

image

image

Figure 2: (a) Sorted smoothed ∆x (b) Sorted entropies of the magnified cluster

Proof. From (A + E1)˜x = band (A + E1 + E2)˜y = b, we get  A(˜x − ˜y) = −E1˜x1+ (E1 + E2)˜y. Similar arguments as in the proof of Theorem 3.3 can be

used to show that

image

We deduce that

image

Since  F1, . . . , Ftare linearly independent from the rest of features in S, we

get

image

Now, if�c1 · · · ct�and�˜x1 − ˜y1 · · · ˜xt − ˜yt�are not proportional, we can use Equation (5) and our first hypothesis to get a dependence relation of a shorter length between the elements of  S′, which would contradict our assumption that any proper subset of  S′is linearly independent. The proof is complete.

image

We also remark that our method is insensitive to shuffling of the dataset D. That is, if we exchange rows (or columns), there is an insignificant change in ∆x. We have demonstrated this fact through experiments in Tables 4; we offer a proof as follows.

image

Proof. We show this for permutation of rows and a similar argument can be made for permuting columns. Suppose that  D1= [A1 | b1] is obtained from D by shuffling rows. First, assume that only two rows are exchanged. Then there exists an elementary matrix T such that  T A = A1and  T b = b1. Since T is invertible, it follows that Ax = b if and only if  A1x = b1. For the general case, we note that every shuffling is a composition of elementary matrices.

3.2. Algorithm

The Flowchart and algorithm of DRPT are as follows. The MATLAB®implementation of DRPT is publicly available in GitHub 1.

image

Figure 3: Flowchart of Dimension Reduction based on Perturbation Theory (DRPT)

image

3.3. Complexity

The complexity of our proposed method is dominated by the complexity of the SVD which is  O(mn2, m2n), since the inverse of perturbed ˜A is calculated using SVD.

We compared our method with seven state-of-the-art FS methods, namely minimal-redundancy-maximal-relevance criterion (mRMR) [17], least angle regression (LARS) [18], Hilbert-Schmidt Independence Criterion Lasso (HSICLasso) [19], Fast Online Streaming FS (Fast-OSFS) and Scalable, Accurate Online FS (group-SAOLA) [34], Conditional Covariance Minimization (CCM) [22] and Binary Coyote Optimization Algorithm (BCOA) [23] . We used MATLAB ®implementations of LARS and LASSO by Sj¨ostrand [45], HSIC-Lasso and BCOA by their authors, Fast-OSFS and group-SAOLA given in the open source library [34]. The CCM method is implemented in Python by their authors and its code available at GitHub2.

To have a fair comparison among different FS methods, we read the datasets by the same function and use a stratified partitioning of the dataset so that 70% of each class is selected for FS. Then we use SVM and RF classifiers implemented in MATLAB®, to evaluate the selected subsets of features on the remaining 30% of the dataset. We have used linear kernel in SVM (default setting of SVM in MATLAB®) and as for RF, we set 30 as the number of trees and the other parameters have default values.

4.1. Datasets

We select a variety of dataset from Gene Expression Omnibus (GEO) 3and dbGaP 4to perform FS and classification. The specifications of all datasets are given in Table 2.

Table 2: Dataset Specifications

image

All the GEO datasets are publicly available. To pre-process the data, we develop an R code to clean and convert any NCBI dataset to CSV format 5. We use GEO2R [46] to retrieve the mappings between prob IDs and gene samples. Probe IDs without a gene mapping were removed. Expression values of each gene are the average of expression values of all mapped prob IDs to that gene. We also handle missing values with k-Nearest Neighbors (kNN) imputation method.

The dataset NeuroX holds SNP information about subjects’ Parkinson disease status and sociodemographic (e.g., onset age/gender) data. Parkinson’s disease status coded as 0 (control) and 1 (case), from clinic visit using modified UK Brain Bank Criteria for diagnosis. The original NeuroX has 11402 samples, and it is only accessible by authorized access via dbGaP. It has 535202 features that each two sequence features are considered as a SNP. So after cleaning and merging features of NeuroX, we use two subsets of 100 and 200 samples with 267601 SNPs (NX100 and NX200) for this paper.

4.2. Parameters

A FS method that selects most relevant and non-redundant features (Minimum Redundancy and Maximum Relevancy) is favorable in the sense that the top k selected features retain most of the information about the dataset. On the other hand, the top selected genes in a genomic dataset must be further analyzed in wet labs to confirm the biological relevance of the genes to the disease. For example, authors in [47] first identified 50 top genes of a Colon cancer dataset using their FS method. Then, they selected the first 15 genes, because adding more genes would not result in significant changes to the prediction accuracy. Similar studies [48, 49] suggest considering the top 50 features. So, in Table 3, we set k = 50 to select a subset of 50 features using FS algorithms. Then, for t = 1 to k, we feed the first t features to the classifier to find an optimal t so that the subset of first t features yields the highest accuracy. This set up is applied across all FS methods. In Figure 4, we expand this idea by considering up to 90 features using each FS method. We can see that there is small, incremental changes in classification accuracies when we increase the number of features from 50 to 90.

We report the average classification accuracies and average number of selected features over 10 independent runs where the dataset is row shuffled in each run. We note the top k selected features using a FS algorithm might differ over different runs because the dataset is row shuffled and so the training set changes on every run. Also, optimal subset size for SVM and RF might be different, in other words SVM might attain the maximum accuracy using the first 20 features while RF might attain its maximum using the first 25 features.

Both Fast-OSFS and group-SAOLA have a parameter  α, which is a threshold on the significance level. The authors of the LOFS [34] in their Matlab user manual 6recommend setting  α= 0.05 or  α= 0.01. However, our experiments based on these parameters showed inferior classification accuracies compared to other methods across all datasets given in Table 2. We also note that the running times of both Fast-OSFS and group-SAOLA increased as we increased

image

Experimenting with various values of  α, we realized that increasing  αfrom 0.01 to 0.5 exhibited clear improvement in classification accuracies on all datasets except NeuroX. So, we set  α= 0.5 for all datasets except NeuroX.

We also experienced that both Fast-OSFS and group-SAOLA on NeuroX may not execute all the time when  α > 0.0005; often errors were generated as part of a statistics test function. So, for NeuroX, we set  α= 10−5for both Fast-OSFS and group-SAOLA.

The group-SAOLA model has an extra parameter for setting the number of selected groups, selectGroups. As there was no default or recommended value for this parameter, we obtained results by varying selectGroups from 2 to 10 for all the datasets, and we chose the highest accuracy for each dataset.

4.3. Hardware and Software

Our proposed method DRPT and other FS methods in section 4 have been run on an IBM®LSF 10.1.0.6 machine (Suite Edition: IBM Spectrum LSF Suite for HPC 10.2.0) with requested 8 nodes, 24 GB of RAM, and 10 GB swap memory using MATLAB®R2017a (9.2.0.556344) 64-bit(glnxa64). Since CCM is implemented in Python and uses TensorFlow[50], we requested 8 nodes, 120 GB of RAM, and 40 GB swap memory on the LFS machine using Python 3.6.

4.4. Results.

The average number of selected features and average classification accuracies over 10 independent runs using SVM and RF on the datasets described in Section 4.1 are shown in Table 3.

The empty spaces in Table 3 under LARS, HSIC-Lasso, CCM and BCOA’s columns simply mean that these methods do not run on those datasets; this is a major shortfall of these methods and it would be interesting to find out why and to what extent LARS, HSIC-Lasso and BCOA fail to run on a dataset.

Table 3: Superscript is average number of selected features and subscript is resulting classifi- cation accuracies (CA) based on SVM and RF using mRMR, LARS, HSIC-Lasso, Fast-OSFS, group-SAOLA, CCM, BCOA and DRPT over 10 runs.

image

Since the NeuroX datasets have 267,601 features, CCM method requires 1.5 TB of RAM to execute.

In terms of accuracy using either of SVM or RF, we can see from Table 3 that DRPT is at least as good as any of the other seven methods. We can further infer that, in general, SVM has a better performance than RF on these datasets and SVM requires more features than RF to attain the maximum possible accuracy.

In Table 4, we report the standard deviation (SD) of the number of selected features and SD of the classification accuracies over 10 independent runs. Lower SDs are clearly desirable, which is also an indication of the method’s stability with respect to permutation of rows.

Figure 4 shows the average classification accuracy results of our DRPT compared to other FS methods using k features and the SVM classifier, where k is between 10 and 90. When a FS method returns a subset of k features, we use SVM to find an optimal  t ≤ kso that the first t features yield the best accuracy. Note that we do not look for the best subset and rather add the features se-

Table 4: Superscript is SD of # selected features and subscript is the SD of resulting classifi- cation accuracies (CA) based on SVM and RF using mRMR, LARS, HSIC-Lasso, Fast-OSFS, group-SAOLA, CCM, BCOA and DRPT over 10 runs.

image

quentially to find the optimal t. Then we calculate the accuracy using the first t features and take the average of these accuracies over 10 runs. Our evaluation metric is consistent across all FS methods.

Figure 4 shows the general superiority of the classification accuracy of our proposed model over the other models for the 9 genomic datasets used in our study. We can see a steady increase in classification accuracies of different FS methods as we increase k from 10 to 50, however the curves usually flat out when k is between 50 and 90. Note that HSIC-Lasso, Fast-OSFS, and group-SAOLA models output a subset of fewer than 60 features.

We note that the default number of selected features by LARS is almost the number of samples in a dataset. In Table 5, we perform a further comparison between LARS and DRPT where we set k to be the default number of features suggested by LARS and we use the classifier to find an optimal subset of size at most k. For example, the dataset GDS1615, has 127 samples in total. Since we take approximately 70% of samples for FS, the suggested number of features by LARS is k = 87.

Table 5: Superscript is average number of selected features and subscript is resulting classifica- tion accuracies (CA) based on SVM and RF using LARS Suggestion (LS) for 10 independent runs of DRPT and LARS.

image

If we look at the performance of LARS just based on its default number of features, we note that CA of LARS significantly drops. This, in particular, suggests that LARS does not select an optimal subset of features.

We also take advantage of IBM®LSF to report running time, CPU time and memory usage of each FS method. We just remark that through parallelization, an algorithm might achieve a better running time at the cost of having greater CPU time.

Figure 5 depicts running time of FS methods that includes the classification time using SVM as well. We can see that LARS, HSIC-Lasso, and DRPT have comparable running times. The running times of Fast-OSFS, group-SAOLA

image

Figure 4: Classification accuracies (CA) based on SVM using mRMR, LARS, HSIC-Lasso, Fast-OSFS, group-SAOLA, CCM, BCOA and DRPT over 10 runs for different number of

image

image

Figure 5: Running Time of feature selection by DRPT, HSIC-Lasso, LARS, Fast-OSFS, group-

image

and BCOA are higher than DRPT while the running time of mRMR is the worst among all by order of magnitude.

Next, we compare CPU time. For a non-parallelized algorithm, the CPU time is almost the as same as the running time. However, a parallelized algorithm takes more CPU time as it hires multi-processes. Figure 6(a) shows the CPU time that is taken by FS methods on six common datasets. Clearly, mRMR takes the highest CPU time and it is also obvious that HSIC-Lasso uses more processes as it is implemented in parallel.

We also quantify the computational performance of all methods based on the peak memory usage over six common datasets (Figure 6(b)). We observe that mRMR and HSIC-Lasso require an order of magnitude higher memory than LARS. Although the peak memory usage by DRPT is significantly lower than mRMR and HSIC-Lasso, DRPT takes almost the same amount of memory across all datasets. In this regard, there is a potential for more efficient implementation of DRPT.

We had to leave the CCM method out of the comparison in Figures 5 and 6

image

Figure 6: (a) CPU Time and (b) Memory taken by DRPT, HSIC-Lasso , LARS, Fast-OSFS,

image

Table 6: Running time, CPU time and memory taken by CCM model

image

because it is implemented in Python and required a high volume of RAM while the other methods implemented in Matlab. Table 6 shows the CCM performance in terms of running time, CPU time, and memory usage, where running time and CPU time are measured by second and memory usage is scaled in GB.

In this paper, we presented a linear feature selection method (DRPT) for high-dimensional genomic datasets. The novelty of our method is to remove irrelevant features outright and then detect correlations on the reduced dataset using perturbation theory. While we showed DRPT precisely detects irrelevant and redundant features on a synthetic dataset, the extent to which DRPT is effective on real dataset was tested on ten genomic datasets. We demonstrated that DRPT performs well on these datasets compared to state-of-the-art feature selection algorithms. We proved that DRPT is robust against noise. Performance of DRPT is insensitive to permutation of rows or columns of the data. Even though the running time of DRPT is comparable to other FS methods, an efficient implementation of DRPT in Python or C++can help improve both memory usage and the running time.

In this paper, we focused only on genomic datasets because inherently they are similar. For example, they all have full-row rank. Besides, it is widely accepted that there is no dimension reduction algorithm that performs well on all datasets (compared to other methods). In a future work, we aim to revise our current algorithm to offer a new FS algorithm that performs well on face and text datasets.

The authors would like to thank the anonymous reviewers for valuable and practical comments that helped to improved the paper. The research is supported by NSERC of Canada under grant #RGPIN 418201.

[1] L. Breiman, Classification and regression trees, Routledge, 2017.

[2] N. M. Nasrabadi, Pattern recognition and machine learning, Journal of electronic imaging 16 (4) (2007) 049901.

[3] T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B. Eisen, M. Van De Rijn, S. S. Jeffrey, et al., Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications, Proceedings of the National Academy of Sciences 98 (19) (2001) 10869–10874.

[4] G. H. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selec- tion problem, in: Machine Learning Proceedings 1994, Elsevier, 1994, pp. 121–129.

[5] S. Khalid, T. Khalil, S. Nasreen, A survey of feature selection and feature extraction techniques in machine learning, in: Science and Information Conference (SAI), 2014, IEEE, 2014, pp. 372–378.

[6] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, Journal of bioinformatics and computational biology 3 (02) (2005) 185–205.

[7] C. F. Dormann, J. Elith, S. Bacher, C. Buchmann, G. Carl, G. Carr´e, J. R. G. Marqu´ez, B. Gruber, B. Lafourcade, P. J. Leit˜ao, T. M¨unkem¨uller, C. McClean, P. E. Osborne, B. Reineking, B. Schr¨oder, A. K. Skidmore, D. Zurell, S. Lautenbach, Collinearity: a review of methods to deal with it and a simulation study evaluating their performance, Ecography 36 (1) (2013) 27–46.

[8] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, T. Mat- sui, Best subset selection for eliminating multicollinearity, Journal of the Operations Research Society of Japan 60 (3) (2017) 321–336.

[9] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of machine learning research 3 (Mar) (2003) 1157–1182.

[10] M. Gaudioso, E. Gorgone, M. Labb´e, A. M. Rodr´ıguez-Ch´ıa, Lagrangian relaxation for svm feature selection, Computers & Operations Research 87 (2017) 137–145.

[11] Y. Xue, L. Zhang, B. Wang, Z. Zhang, F. Li, Nonlinear feature selection using gaussian kernel SVM-RFE for fault diagnosis, Applied Intelligence 48 (10) (2018) 3306–3331.

[12] P. Chaudhari, H. Agarwal, Improving feature selection using elite breeding qpso on gene data set for cancer classification, in: V. Bhateja, C. A.

Coello Coello, S. C. Satapathy, P. K. Pattnaik (Eds.), Intelligent Engineering Informatics, Springer Singapore, Singapore, 2018, pp. 209–219.

[13] V. Amrhein, S. Greenland, B. McShane, Scientists rise up against statistical significance (2019).

[14] R. L. Wasserstein, A. L. Schirm, N. A. Lazar, Moving to a world beyond p < 0.05” (2019).

[15] R. W. Schafer, et al., What is a savitzky-golay filter, IEEE Signal processing magazine 28 (4) (2011) 111–117.

[16] S. Nogueira, K. Sechidis, G. Brown, On the stability of feature selection algorithms, The Journal of Machine Learning Research 18 (174) (2018) 1–54.

[17] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis & Machine Intelligence (8) (2005) 1226– 1238.

[18] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al., Least angle regres- sion, The Annals of statistics 32 (2) (2004) 407–499.

[19] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, M. Sugiyama, High- dimensional feature selection by feature-wise kernelized lasso, Neural computation 26 (1) (2014) 185–207.

[20] X. Wu, K. Yu, W. Ding, H. Wang, X. Zhu, Online feature selection with streaming features, IEEE transactions on pattern analysis and machine intelligence 35 (5) (2012) 1178–1192.

[21] K. Yu, X. Wu, W. Ding, J. Pei, Scalable and accurate online feature selec- tion for big data, ACM Transactions on Knowledge Discovery from Data (TKDD) 11 (2) (2016) 1–39.

[22] J. Chen, M. Stern, M. J. Wainwright, M. I. Jordan, Kernel feature selection via conditional covariance minimization, in: Advances in Neural Information Processing Systems, 2017, pp. 6946–6955.

[23] R. C. T. de Souza, C. A. de Macedo, L. dos Santos Coelho, J. Pierezan, V. C. Mariani, Binary coyote optimization algorithm for feature selection, Pattern Recognition (2020) 107470.

[24] R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence 97 (1-2) (1997) 273–324.

[25] Y. Yang, J. O. Pedersen, A comparative study on feature selection in text categorization, in: ICML, Vol. 97, 1997, pp. 412–420.

[26] B. Xue, M. Zhang, W. N. Browne, X. Yao, A survey on evolutionary compu- tation approaches to feature selection, IEEE Transactions on Evolutionary Computation 20 (4) (2016) 606–626.

[27] Z. Wang, M. Li, J. Li, A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure, Information Sciences 307 (2015) 73–88.

[28] S. Paul, S. Das, Simultaneous feature selection and weighting–an evolution- ary multi-objective optimization approach, Pattern Recognition Letters 65 (2015) 51–59.

[29] Y. Sun, S. Todorovic, S. Goodison, Local-learning-based feature selection for high-dimensional data analysis, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2009) 1610–1626.

[30] N. Armanfard, J. P. Reilly, M. Komeili, Local feature selection for data classification, IEEE transactions on pattern analysis and machine intelligence 38 (6) (2015) 1217–1227.

[31] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) (1996) 267–288.

[32] X. Chen, G. Yuan, F. Nie, J. Z. Huang, Semi-supervised feature selection via rescaled linear regression., in: IJCAI, 2017, pp. 1525–1531.

[33] M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, A. Saha, H. Ouyang, D. Yin, H. Mamitsuka, C. Sahinalp, et al., Ultra high-dimensional nonlinear feature selection for big biological data, IEEE Transactions on Knowledge and Data Engineering 30 (7) (2018) 1352–1365.

[34] K. Yu, W. Ding, X. Wu, Lofs: a library of online streaming feature selection, Knowledge-Based Systems 113 (2016) 1–3.

[35] J. Zhou, D. P. Foster, R. A. Stine, L. H. Ungar, Streamwise feature selec- tion, Journal of Machine Learning Research 7 (Sep) (2006) 1861–1885.

[36] X. Wu, K. Yu, H. Wang, W. Ding, Online streaming feature selection, in: ICML, 2010, pp. 1159–1166.

[37] K. Yu, X. Wu, W. Ding, J. Pei, Towards scalable and accurate online feature selection for big data, in: 2014 IEEE International Conference on Data Mining, IEEE, 2014, pp. 660–669.

[38] J. Pierezan, L. D. S. Coelho, Coyote optimization algorithm: a new meta- heuristic for global optimization problems, in: 2018 IEEE Congress on Evolutionary Computation (CEC), IEEE, 2018, pp. 1–8.

[39] L. Ali, C. Zhu, M. Zhou, Y. Liu, Early diagnosis of parkinson’s disease from multiple voice recordings by simultaneous sample and feature selection, Expert Systems with Applications 137 (2019) 22–28.

[40] A. S. Ashour, M. K. A. Nour, K. Polat, Y. Guo, W. Alsaggaf, A. El-Attar, A novel framework of two successive feature selection levels using weightbased procedure for voice-loss detection in parkinson’s disease, IEEE Access 8 (2020) 76193–76203.

[41] P. Sharma, S. Sundaram, M. Sharma, A. Sharma, D. Gupta, Diagnosis of parkinson’s disease using modified grey wolf optimization, Cognitive Systems Research 54 (2019) 100–115.

[42] M. Kim, J. H. Won, J. Hong, J. Kwon, H. Park, L. Shen, Deep network- based feature selection for imaging genetics: Application to identifying biomarkers for parkinson’s disease, in: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), IEEE, 2020, pp. 1920–1923.

[43] O. Yaman, F. Ertam, T. Tuncer, Automated parkinson’s disease recogni- tion based on statistical pooling method using acoustic features, Medical Hypotheses 135 (2020) 109483.

[44] A. Ben-Israel, T. N. Greville, Generalized inverses: theory and applications, Vol. 15, Springer Science & Business Media, 2003.

[45] K. Sj¨ostrand, Matlab implementation of LASSO, LARS, the elastic net and SPCA, version 2.0 (June 2005). URL http://www2.imm.dtu.dk/pubdb/p.php?3897

[46] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Toma- shevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, et al., NCBI GEO: archive for functional genomics data sets—update, Nucleic acids research 41 (D1) (2012) D991–D995.

[47] X. Zhou, K. Mao, LS bound based gene selection for DNA microarray data, Bioinformatics 21 (8) (2005) 1559–1564.

[48] M. Gutkin, R. Shamir, G. Dror, SlimPLS: a method for feature selection in gene expression-based disease classification, PloS one 4 (7).

[49] S. Liu, C. Xu, Y. Zhang, J. Liu, B. Yu, X. Liu, M. Dehmer, Feature selection of gene expression data for cancer classification using double rbf-kernels, BMC bioinformatics 19 (1) (2018) 1–14.

[50] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for largescale machine learning, in: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 2016, pp. 265–283.


Designed for Accessibility and to further Open Science