Open Access Open Access  Restricted Access Subscription Access

High Dimensional Unbalanced Data Classification Vs SVM Feature Selection


Affiliations
1 Vijaya Institute of Technology for Women (VITW), Enikepadu, Vijayawada - 521108, Andhra Pradesh, India
2 VFSTR University, Vadlamudi, Guntur - 522213, Andhra Pradesh, India
 

Background/Objectives: It is well known that the performance of the classification models prone to the class imbalance problem. The class imbalance problem occurs when one class of data severely outnumbers the other classes of data. The classification models learned on Support Vector Machines (SVM) are quite prominent in exhibiting better generalization abilities even in the context of the class imbalance problem. However, it is proved that the high imbalance ratio hinders SVM learning performance. With this concern, this paper presents an empirical study on the viability of SVM in the context of feature selection from moderately and highly unbalanced datasets. Methods/Statistical Analysis: The Support Vector Machine-Recursive Feature Elimination (SVM-RFE) wrapper feature selection is analyzed in this study and its performance on one document analysis and two biomedical unbalanced datasets is compared with two prominent feature selection methods like Chi-Square (CHI) test and Information Gain (IG) using Decision Tree and Naive Bayes classification models. Findings: From this empirical study two major identifications are reported: 1. For the considered scenarios, classification models learned on IG and CHI test are better performed than SVM-RFE feature selection of high class imbalance setting. 2. The SVM-RFE on rebalanced data yielded better performance than SVM-RFE on original data. Application/Improvements: Considered feature selection methods, including SVM-RFE yielded better performance on oversampled data than SVM-RFE on original data. Overall, this study reports models learned on Decision Tree exhibited better performance than the models learned on Naïve Bayes classifier.

Keywords

Class Imbalance Problem, Chi-Square, Information Gain, Support Vector Machine, SVM-RFE.
User

Abstract Views: 172

PDF Views: 0




  • High Dimensional Unbalanced Data Classification Vs SVM Feature Selection

Abstract Views: 172  |  PDF Views: 0

Authors

S. Chinna Gopi
Vijaya Institute of Technology for Women (VITW), Enikepadu, Vijayawada - 521108, Andhra Pradesh, India
B. Suvarna
VFSTR University, Vadlamudi, Guntur - 522213, Andhra Pradesh, India
T. Maruthi Padmaja
VFSTR University, Vadlamudi, Guntur - 522213, Andhra Pradesh, India

Abstract


Background/Objectives: It is well known that the performance of the classification models prone to the class imbalance problem. The class imbalance problem occurs when one class of data severely outnumbers the other classes of data. The classification models learned on Support Vector Machines (SVM) are quite prominent in exhibiting better generalization abilities even in the context of the class imbalance problem. However, it is proved that the high imbalance ratio hinders SVM learning performance. With this concern, this paper presents an empirical study on the viability of SVM in the context of feature selection from moderately and highly unbalanced datasets. Methods/Statistical Analysis: The Support Vector Machine-Recursive Feature Elimination (SVM-RFE) wrapper feature selection is analyzed in this study and its performance on one document analysis and two biomedical unbalanced datasets is compared with two prominent feature selection methods like Chi-Square (CHI) test and Information Gain (IG) using Decision Tree and Naive Bayes classification models. Findings: From this empirical study two major identifications are reported: 1. For the considered scenarios, classification models learned on IG and CHI test are better performed than SVM-RFE feature selection of high class imbalance setting. 2. The SVM-RFE on rebalanced data yielded better performance than SVM-RFE on original data. Application/Improvements: Considered feature selection methods, including SVM-RFE yielded better performance on oversampled data than SVM-RFE on original data. Overall, this study reports models learned on Decision Tree exhibited better performance than the models learned on Naïve Bayes classifier.

Keywords


Class Imbalance Problem, Chi-Square, Information Gain, Support Vector Machine, SVM-RFE.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i30%2F129685