Abstract

Feature selection is a key issue in the domain of machine learning and related fields. The results of feature selection can directly affect the classifier’s classification accuracy and generalization performance. Recently, a statistical feature selection method named effective range based gene selection (ERGS) is proposed. However, ERGS only considers the overlapping area (OA) among effective ranges of each class for every feature; it fails to handle the problem of the inclusion relation of effective ranges. In order to overcome this limitation, a novel efficient statistical feature selection approach called improved feature selection based on effective range (IFSER) is proposed in this paper. In IFSER, an including area (IA) is introduced to characterize the inclusion relation of effective ranges. Moreover, the samples’ proportion for each feature of every class in both OA and IA is also taken into consideration. Therefore, IFSER outperforms the original ERGS and some other state-of-the-art algorithms. Experiments on several well-known databases are performed to demonstrate the effectiveness of the proposed method.

1. Introduction

Feature selection is widely used in the domain of pattern recognition, image processing, data mining, and machine learning before the tasks of clustering, classification, recognition, and mining [1]. In real-world applications, the huge dataset usually has a large number of features which contains much irrelevant or redundant information [1]. Redundant and irrelevant features cannot improve the learning accuracy and even deteriorate the performance of the learning models. Therefore, selecting an appropriate and small feature subset from the original features not only helps to overcome the “curse of dimensionality” but also contributes to accomplish the learning tasks effectively [2]. The aim of feature selection is to find a feature subset that has the most discriminative information from the original feature set. In general, feature selection methods are usually divided into three categories: embedded, wrapper, and filter methods [3, 4]. They are categorized based on whether or not they are combined with a specific learning algorithm.

In the embedded methods, the feature selection algorithm is always regarded as a component in the learning model. The most typical embedded based feature selection algorithms are decision tree approaches, such as ID3 [5], C4.5 [6], and CART algorithm [7]. In these algorithms, the features with the strongest ability of classification are selected in the nodes of the tree, and then the selected features are utilized to conduct a subspace to perform the learning tasks. Obviously the process of decision tree generation is also feature selection process.

Wrapper methods directly use the selected features to train a specific classifier and evaluate the selected subset according to the performance of the classifier. Therefore, the performances of wrapper methods strongly depend on the given classifier. Sequential forward selection (SFS) and sequential backward selection (SBS) [8] are two well-studied wrapper methods. SFS was initialized to an empty set. Then, the best feature from the complete feature set was chosen according to the evaluation criteria in each step and added into the candidate feature subset until it meets the stop condition. On the contrary, SBS started from the complete feature set. Then, it eliminated a feature which has the minimal impact on the classifier in each step until it satisfied the stop condition. Recently, Kabir et al. proposed a new wrapper based feature selection approach using neural network [9]. The algorithm was called constructive approach for feature selection (CAFS). The algorithm used a constructive approach involving correlation information to select the features and determine the architectures of neural network. Another wrapper based feature selection method was also proposed by Ye and Gong. In their approach, they considered the feature subset as the evaluation unit and the subset’s convergence ability was utilized as the evaluation standard [10] for feature selection.

Different from the embedded and wrapper based algorithms, filter based feature selection methods directly select the best feature subset based on the intrinsic properties of the data. Therefore, the process of feature selection and learning model is independent in them. At present, the algorithms of filter based feature selection can be divided into two classes [11]: ranking and space searching. For the former, the feature selection process can be regarded as a ranking problem. More specifically, the weight (or score) of each feature is firstly computed. Then, the top features are selected according to the ascending order of weight (or score). Pearson Correlation Coefficient (PCC) [12], Mutual Information (MI) [13], and Information Gain (IG) [14] are three commonly used ranking criterion to measure the dependency between each feature and the target variable. Another ranking criterion method named Relief [15], which analyzed the importance of each feature by computing the relationship between an instance and its nearest neighbors from the same and different classes, was proposed by Kira and Rendell. Then, an extension of Relief termed Relief-F was developed in [16]. Besides, there also exist many other methods proposed for ranking based filter feature selection. For more details about these algorithms, the readers can refer to [3, 4]. Although the ranking based filter methods have been applied to some real-world tasks successfully, a common shortcoming of these methods is that the feature subset selected by them may contain redundancy. In order to solve this problem, some space searching based filter methods have been proposed to remove the redundancy during feature selection. Correlation-based feature selection (CFS) [17] is a typical space searching algorithm; it did not only consider the correlation among features but also take the correlation between features and classes into account. Thus, CFS inclined to select the subset contains features that are highly correlated with the class and uncorrelated with each other. Minimum redundancy maximum relevance (MRMR) [18] is another method presented to reduce the redundancy of the selected feature subset.

Since both embedded and wrapper based feature selection methods interact with the classifier, they can only select the optimal subset for a particular classifier. So the features selected by them may be worse for other classifiers. Moreover, another disadvantage of the two methods is that they are more time consuming than filter method. Therefore, filter method is more fit for dealing with data that has large amounts of features since it has a good generalization ability [19]. As a result, we mainly focus on the research for filter based feature selection in this work.

In this paper, an integrated algorithm named Improved feature selection based on effective range (IFSER) is proposed for filter based feature selection. Our IFSER can be considered as an extension of the study in [20]. In [20], Chandra and Gupta presented a new statistical feature selection method named effective range based gene selection (ERGS). ERGS utilized the effective range of statistical inference theory [21] to calculate the weight of each feature, and a higher weight was assigned to the most important feature to distinguish different classes. However, since ERGS only considered the overlapping area (OA) among effective range of each class for every feature, it fails to handle the other relationships among the features of different classes. In order to overcome this limitation, the concept of including area (IA) is introduced into the proposed IFSER to characterize the inclusion relationship of effective ranges. Moreover, the samples’ proportion for each feature of every class in both OA and IA is also taken into consideration in our IFSER. Therefore, IFSER outperforms the original ERGS and some other state-of-the-art algorithms. Experiments on several well-known databases are performed to demonstrate the effectiveness of the proposed method.

The rest of this paper is organized as follows. Section 2 briefly reviews ERGS and effective range. The proposed IFSER is introduced in Section 3. Section 4 reports experimental results on four datasets. Finally, we provide some conclusions in Section 5.

2. A Briefly Review on ERGS

In this section, we will review the effective range and ERGS algorithm briefly [20].

Let be the feature set of the dataset ,  .    is the class labels of . The class probability of th class is . For each class of the th feature , and denote the mean and standard deviation of the th feature for class , respectively. Effective range () of th class for th feature is defined by where and are the lower and upper bounds of the effective range, respectively. The prior probability of th class is . Here, the factor () is taken to scale down effect of class with high probabilities and consequently large variance. The value of is determined statistically by Chebyshev inequality defined as which is true for all distributions. The value of is set as 1.732 for the effective range which contains at least 2/3rd of the data objects [20].

Overlapping area () among classes of feature is computed by where can be defined as

In ERGS, for a given feature, the effective range of every class is first calculated. Then, the overlapping area of the effective ranges is calculated according to (3), and the area coefficient is computed for each feature. Next, the normalized area coefficient is regarded as the weight for every feature and an appropriate number of features are selected on the basis of feature weight. For more detailed information about the ERGS algorithm, the readers can refer to [20].

3. Improved Feature Selection Based on Effective Range

In this section, we present our improved feature selection based on effective range (IFSER) algorithm, which integrates overlapping area, including area and the samples’ proportion for each feature of every class, into a unified feature selection framework.

3.1. Motivation

Although ERGS considers the overlapping area of every class for each feature, it fails to handle the problem of the inclusion relation of effective ranges. The problem is very realistic in real-world applications. Taking the gene data set as an example, Figure 1 shows the effective ranges of two gene samples from the Leukemia2 [22] gene database. From this figure, we can see that the overlapping area of gene number 9241 in Figure 1(a) is 165.7, and the overlapping area of gene number 3689 in Figure 1(b) is 170.8. Since the two overlapping areas of these two genes are similar, their weights obtained by ERGS are also similar. However, the relationships between the effective ranges in these two genes are very different. In Figure 1(a), the effective range of class 1 is completely included in the effective range of class 2, while the effective range of class 1 is partly overlapping with the effective range of class 2 in Figure 1(b). Therefore, the weight of the gene number 9241 in Figure 1(a) should be less than that in Figure 1(b) since all the samples in class 1 cannot be corrected and classified in this case. For this reason, the inclusion relation between the effective ranges (including area) must be taken into consideration.

Another example is shown in Figure 2. As can be seen from this figure, it is clearly found that the two features in Figures 2(a) and 2(b) have the same size of the overlapping area. However, the number of samples in these two areas is very different. In Figure 2(a), the number of samples belonging to the overlapping area is small but the number of samples belonging to the overlapping area in Figure 2(b) is relatively large. Thus, it is obvious that feature 1 is more important than feature 2 since more samples can be correctly classified. In other words, the weight assigned to feature1 should be greater than that assigned to feature 2. From this example, we can see that the samples’ proportion for each feature of every class in both overlapping and including areas is also a vital factor to influence the features’ weights and should be considered in the feature selection process.

3.2. Improved Feature Selection Based on Effective Range

Similar to ERGS, we suppose is the feature set of the dataset , .    is the class label set of the data samples in . The class probability of th class is . For each class of th feature , and denote the mean and standard deviation of the th feature in class , respectively.

The first step of our proposed IFSER is to calculate the effective range of every class by where the definitions of , , , and are the same as those in ERGS.

The second step of our IFSER is to calculate overlapping areas among classes of feature    by where the definition of is as same as in ERGS.

The third step of our proposed IFSER is to compute including area among classes of feature    by where can be defined as

The fourth step of our proposed IFSER is to compute area coefficient () of feature    as where . Then, the normalized area coefficient () can be obtained by

From (10), we can clearly see that the features with larger NAC values are more important for distinguishing different classes.

The fifth step of our proposed IFSER is to calculate the samples’ number of each class in and for each feature . Let and denote samples’ numbers of the th class in and for each feature . Assume that   represents the number of samples in theth class. Then we use divided by and to represent the proportions of samples in and , and for all classes of each feature the sums of the and are written as and .

For all classes of each feature , the normalized and can be obtained by From (11), the larger the value of and , the more significant the feature is.

The last step of our proposed IFSER is to compute the weight of each feature as where and . From (12), we can see that a larger value of indicates that the th feature is more important. Therefore, we can select the features according to their weights and choose features with larger weights to form the selected feature subset.

Finally, the proposed IFSER algorithm can be summarized as in Algorithm 1.

Input: Data matrix , , the number of
selected feature k.
Output: Feature subset.
(1) Compute the ER of each feature by (5);
(2) Compute the and by (6) and (7);
(3) Compute the by (9);
(4) Normalize the by (10);
(5) Calculate the and by (11);
(6) Compute the weight of each feature by (12);
(7) Sort the weight of all features in a descending order;
(8) Select the best k features;

4. Experiment and Results

In this section, in order to verify the performance of our proposed method, we conducted experiments on four datasets (Lymphoma [23], Leukemia1 [24], Leukemia2 [22], and 9_Tumors [25]) and compare our algorithm with five popular feature selection algorithms including ERGS [20], PCC [12], Relief-F [16], MRMR [18], and Information Gain [14]. Three classifiers are used to verify the effectiveness of our proposed method. The classification accuracies are obtained through leave-one-out cross-validation (LOOCV) in this work.

4.1. The Description of Datasets
4.1.1. Lymphoma Database

The Lymphoma database [23] consists of 96 samples and 4026 genes. There are two classes of samples in the dataset. The dataset comes from a study on diffuse large B-cell lymphoma.

4.1.2. Leukemia1 Database

Leukemia1 database [24] contains three types of Leukemia samples. The database has been constructed from 72 people who have acute myelogenous leukemia (AML), acute lymphoblastic leukemia (ALL) B cell, or ALL T-cell, and each sample is composed of 5327 gene expression profiles.

4.1.3. Leukemia2 Database

The Leukemia2 dataset [22] contains a total of 72 samples in three classes: AML, ALL, and mixed-lineage leukemia (MLL). The number of genes is 11225.

4.1.4. 9_Tumors Database

9_Tumors database [25] consists of 60 samples of 5726 genes and categorized into 9 various human tumor types.

4.2. Experimental Results Using C4.5 Classifier

In this subsection, we estimate the performance of our proposed IFSER using C4.5 classifier on the four gene databases. Tables 1, 2, 3, and 4 summarize the results of the classification accuracies achieved by our methods and other methods. As we can see from Tables 14, the proposed IFSER method performs better than the other five algorithms in most cases. In particular, our proposed IFSER is much better than ERGS. The reason is that our proposed IFSER not only considers the overlapping area (OA) but also takes the including area and samples’ proportion into account. These results demonstrate the fact that IFSER is able to select the best informative genes compared to other well-known techniques.

For Lymphoma database, the classification accuracy of our proposed IFSER is substantial improvement compared with other algorithms. What is more, it is worth mentioning that our method only uses 10 features to achieve 93.75% classification accuracy. With the increase in feature dimension, the classification results of most methods (such as our proposed IFSER, PCC, IG, and ERGS) are reduced. For Relief-F and MRMR, the classification results are very low when the feature dimension is equal to 10 at the beginning. Then, with the increase in feature dimension, the classification results are improved. When they achieve the best results, the classification results begin to decrease with the increase in the dimension again.

For Leukemia1 and Leukemia2 databases, the performance of our proposed IFSER is also better than ERGS and other methods. Our proposed IFSER can achieve the best results when the feature dimension is between 50 and 70. For Leukemia1 database, the performances of MRMR and ERGS keep stable on most dimensions. The trend of the classification results of PCC on Leukemia2 is inconsistent with those on Lymphoma database since it is almost monotonously decreased with the increase of feature dimension. And the other results are consistent with the experiments on Lymphoma database.

For 9_Tumors database, as we can see from Table 4, the performances of all the methods are very low due to the fact that database only contains 60 samples but 5726 genes. However, the performance of our proposed IFSER is much better than other algorithms. This result demonstrates the fact that our proposed IFSER is able to deal with the small sample size and high dimensions gene data.

4.3. Experimental Results Using NN Classifier

In this subsection, we evaluate the performance of our proposed IFSER using nearest neighbor (NN) classifier on the four gene databases. The results of the classification accuracies achieved by our proposed and other methods are listed in Tables 5, 6, 7, and 8. Comparing Tables 58 with Tables 14, we can see that the classification results of all the methods are improved. For Lymphoma database, IFSER, PCC, and ERGS are better than Relief-F, IG, and MRMR. For Leukemia1 database, our proposed IFSER and PCC outperform Relief-F, IG, MRMR, and ERGS. And the best result of IFSER is the same as PCC. However, for Leukemia2, IFSER, IG, and Relief-F achieve the best results than PCC, MRMR, and ERGS. For 9_Tumors database, the performance of IFSER is worse than PCC, IG, and MRMR, but better than Relief-F and ERGS. These results demonstrate the fact that result of feature selection depends on the classifier, and it is crucial to choose an appropriate classifier for different feature selection methods.

4.4. Experimental Results Using SVM Classifier

The performance of our proposed IFSER using support vector machine (SVM) classifier on the four gene database is tested in this subsection. Figures 36 show the classification accuracies of different algorithms on four gene databases. From Figures 3 and 4, we can see that our proposed IFSER outperforms other algorithms in most cases. And the IFSER achieves its best result at a lower dimension than other algorithms. This result further demonstrates the fact that IFSER is able to select the best informative genes as compared to other feature selection techniques. As we can see from Figure 5, our proposed IFSER is worse than Relief-F, IG, MRMR and ERGS. From Figure 6, it is found that our proposed IFSRE outperforms PPC, Relief-F, IG, and ERGS but is not as good as MRMR. This indicates that the SVM classifier is not suitable for the feature selected results of our proposed algorithm on small sample size databases.

5. Conclusions

In this paper, we propose a novel statistical feature selection algorithm named effective range based gene selection (IFSER). Compared with existing algorithms, IFSER not only considers the overlapping areas of the features in different classes but also takes the including areas and the samples’ proportion in overlapping and including areas into account. Therefore, IFSER outperforms the original ERGS and some other state-of-the-art algorithms. Experiments on several well-known databases are performed to demonstrate the effectiveness of the proposed method.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by Fund of Jilin Provincial Science & Technology Department (nos. 201115003, 20130206042GX), Fundamental Research Funds for the Central Universities (no. 11QNJJ005), the Science Foundation for Postdoctor of Jilin Province (no. 2011274), and Young Scientific Research Fund of Jilin Province Science and Technology Development project (no. 20130522115JH).