Enhancing the Classification Accuracy of Noisy Dataset by Fusing Correlation Based Feature Selection with K-Nearest Neighbour
The performance of data mining and machine learning tasks can be significantly degraded due to the presence of noisy, irrelevant and high dimensional data containing large number of features. A large amount of real world data consist of noise or missing values. While collecting data, there may be many irrelevant features that are collected by the storage repositories. These redundant and irrelevant feature values distorts the classification principle and simultaneously increases calculations overhead and decreases the prediction ability of the classifier. The high-dimensionality of such datasets possesses major bottleneck in the field of data mining, statistics, machine learning. Among several methods of dimensionality reduction, attribute or feature selection technique is often used in dimensionality reduction. Since the k-NN algorithm is sensitive to irrelevant attributes therefore its performance degrades significantly when a dataset contains missing values or noisy data. However, this weakness of the k-NN algorithm can be minimized when combined with the other feature selection techniques. In this research we combine the Correlation based Feature Selection (CFS) with k-Nearest Neighbour (k-NN) Classification algorithm to find better result in classification when the dataset contains missing values or noisy data. The reduced attribute set decreases the time required for classification. The research shows that when dimensionality reduction is done using CFS and classified with k-NN algorithm, dataset with nil or very less noise may have negative impact in the classification accuracy, when compared with classification accuracy of k-NN algorithm alone. When additional noise is introduced to these datasets, the performance of k-NN degrades significantly. When these noisy datasets are classified using CFS and k-NN together, the percentage in classification accuracy is improved.
Keywords
- Syed Imtiyaz Hassan, 2017, “Designing a flexible system for automatic detection of categorical student sentiment polarity using machine learning”, International Journal of u- and e- Service, Science and Technology, vol. 10, issue.3, Mar 2017, ISSN: 2005-4246.
- P. Langley and S. Sage, 1994, Oblivious decision trees and abstract cases, In Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, W.A, AAAI Press.
- Jiawei Han, Micheline Kamber and Jian Pei, 2012, Data Mining Concept and Techniques.3rd ed. Morgan Kaufmann Publishers,201,.p. 99-105.
- D.L. Donoho, 2011, High-dimensional data analysis: The curses and blessings of dimensionality. Lecture delivered at the “Mathematical Challenges of the 21st Century” conference of The American Math. Society, Los Angeles, August 6-11. Available at http://statweb.stanford.edu/~donoho/Lectures/AMS2000/ MathChallengeSlides2*2.pdf.
- L. Breiman, Random forests, 2001, Technical report, Department of Statistics, University of California.
- Batista G, Monard MC, 2003, An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
- Breiman L, 1996, Bagging predictors. Mach Learn 24:123–140
- Yu L, Liu H, 2004, Efficient feature selection via analysis of relevance and redundancy. JMLR 5:1205–1224.
- G G.Kesavaraj , Dr.S.Sukumaran, 2013, A Study On Classification Techniques in Data Mining, Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), DOI :10.1109/ICCCNT.2013.6726842
- S.B. Kotsiantis, I. D. Zaharakis, P. E . Pintelas, 2006, Machine learning: A review of classification and combining techniques. DOI: 10.1007/s10462-007-9052.
- Mark A. Hall, 1999, Correlation-based Feature Selection for Machine Learning, Ph. D. Dissertation, The University of Waikato, New Zealand.
- Syed Imtiyaz Hassan, 2016, “Extracting the sentiment score of customer review from unstructured big data using Map Reduce algorithm”, International Journal of Database Theory and Application, vol. 9, issue 12, Dec 2016, pp. 289-298, DOI:10.14257/ijdta.2016.9.12.26, ISSN: 2005-4270.
- UCI Machine Learning Repository, Available at http://mlr.cs.umass.edu/ml/datasets.html, accessed Sep 16.
- Weka Documentation, Available at www.cs.waikato.ac.nz, accessed Sep 16.
- G Guyon and Elissee, 2003, Greedy stepwise search : An introduction to variable and feature selection. Journal of Machine Learning research.
Abstract Views: 304
PDF Views: 0