Open Access Open Access  Restricted Access Subscription Access

An Enhanced Low Frequency Discretizer (ELFD) in Data Cleansing Stage


Affiliations
1 Charles Sturt University, Sydney Campus, Australia
2 Ashford University, San Diego, California, United States
 

Objective: Organizations always use data to help their knowledge discovery by using data mining techniques nowadays. Discretization algorithms are the main techniques to discover knowledge in the data cleansing stage. This study is to develop an enhanced discretization algorithm to investigate the impact of data cleansing on knowledge discovery. Methodology: The ELFD algorithm is based on the Low Frequency Discretizer (LFD) which includes four phases: copying dataset, calculating correlation ratio, identifying cut points and discretizing datasets. Using a part of the categorical attributes is to increase the correlation ratio between a numerical attribute and each categorical attribute. We evaluate the new discretization algorithm by using health datasets compared with LFD. The classification accuracy of the discretized dataset is the major criteria for evaluating the ELFD. Finding: The classification accuracy of the ELFD is greater than the classification accuracy of the LFD. Accuracy is enhanced by approximately 9% with the use of the ELFD. Considering manual recording errors, the time processing of the ELFD is similar to the LFD algorithm. Conclusion: The ELFD adds an additional step by choosing the top 75% categorical attributes for which the correlation ratio values are largest and then calculates the correlation ratio between the numerical attribute and these categorical attributes. Using a part of the categorical attributes increases correlation ratio values so that the ELFD improves knowledge discovery from personal information contained in health records during the stage of data cleansing.
User

  • Rahman G, Islam Z. Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Systems with Applications Journal. 2016; 16(1):410–23. https://doi.org/10.1016/j.eswa.2015.10.005.
  • Kurgan L, Cios K. CAIM Discretization Algorithm. IEEE Transaction on Knowledge and Data Engineering Journal. 2004; 16(2):145–52. https://doi.org/10.1109/TKDE.2004.1269594.
  • Pyle D. Data preparation for data mining. Morgan Kaufmann; 1999.
  • Jiang F, Sui Y. A novel approach for discretization of continuous attributes. Knowledge-Based Systems Journal. 2014; 73:324–34. https://doi.org/10.1016/j.knosys.2014.10.014.
  • Garcia S, Herrera F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering Journal. 2013; 25(4):1–25. https://doi.org/10.1109/TKDE.2012.35.
  • Ching J, Wong A, Chan K. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1995; 17(7):641–51. https://doi.org/10.1109/34.391407.
  • Madhu G, Rajinikanth TV, Govardhan A. Improve the classifier accuracy for continuous attributes in biomedical datasets using a new discretization method. Procedia Computer Science; 2014. p. 671–9. https://doi.org/10.1016/j.procs.2014.05.315.
  • Rahman G, Islam Z. Missing value imputation using a fuzzy clustering-based. Springer London. 2015; 46(2):389– 422.
  • Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M. Methods for imputation of missing values in air quality data sets. Atmospheric Environment. 2004; 38(18):2895–907. https://doi.org/10.1016/j.atmosenv.2004.02.026.
  • Rahman G, Islam Z. Data Quality Improvement by Imputation of Missing Values. CSIT. 2013; 38(18):82–8.
  • Cheng K. Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recognition; 2012. p. 1281–9. https://doi.org/10.1016/j.patcog.2011.10.012.
  • Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JM, Finkelstein SN. Missing data in medical databases: Impute, delete or classify? Artificial Intelligence in Medicine. 2013; 58(1):63–72. https://doi.org/10.1016/j.artmed.2013.01.003. PMid:23428358 .
  • Allison PD. Missing data. Sage University papers series on quantitative applications. Thousand Oaks: SAGE Publications; 2001.
  • Li H, Shao F, Li G. Semi-supervised imputation for microarray missing value estimation. IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2014. p. 297– 300. https://doi.org/10.1109/BIBM.2014.6999172.
  • Meng F, Cai C, Yan H. A bicluster-based bayesian principal component analysis method for microarray missing value estimation. IEEE journal of Biomedical and Health Informatics; 2014. p. 1–863.
  • Rahman G, Islam Z, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. IEEE World Congress on Computational Intelligence. 2012:2190–9. https://doi.org/10.1109/IJCNN.2012.6252669.
  • Ghoting A, Parthasarathy S, Otey ME. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery Journal. 2008; 16(3):349–64. https://doi.org/10.1007/s10618-008-0093-2.
  • Fisher RA. Statistical methods for research workers. Genesis Publishing; 1925. PMid:17246289
  • UC Irvine Machine Learning Repository [Internet]. [cited 2010 Feb]. Available from: http://archive.ics.uci.edu/ml/index.php. Date accessed: 02/2010.

Abstract Views: 280

PDF Views: 0




  • An Enhanced Low Frequency Discretizer (ELFD) in Data Cleansing Stage

Abstract Views: 280  |  PDF Views: 0

Authors

Che Li
Charles Sturt University, Sydney Campus, Australia
Abeer Alsadoon
Charles Sturt University, Sydney Campus, Australia
P. W. C. Prasad
Charles Sturt University, Sydney Campus, Australia
A. Elchouemi
Ashford University, San Diego, California, United States

Abstract


Objective: Organizations always use data to help their knowledge discovery by using data mining techniques nowadays. Discretization algorithms are the main techniques to discover knowledge in the data cleansing stage. This study is to develop an enhanced discretization algorithm to investigate the impact of data cleansing on knowledge discovery. Methodology: The ELFD algorithm is based on the Low Frequency Discretizer (LFD) which includes four phases: copying dataset, calculating correlation ratio, identifying cut points and discretizing datasets. Using a part of the categorical attributes is to increase the correlation ratio between a numerical attribute and each categorical attribute. We evaluate the new discretization algorithm by using health datasets compared with LFD. The classification accuracy of the discretized dataset is the major criteria for evaluating the ELFD. Finding: The classification accuracy of the ELFD is greater than the classification accuracy of the LFD. Accuracy is enhanced by approximately 9% with the use of the ELFD. Considering manual recording errors, the time processing of the ELFD is similar to the LFD algorithm. Conclusion: The ELFD adds an additional step by choosing the top 75% categorical attributes for which the correlation ratio values are largest and then calculates the correlation ratio between the numerical attribute and these categorical attributes. Using a part of the categorical attributes increases correlation ratio values so that the ELFD improves knowledge discovery from personal information contained in health records during the stage of data cleansing.

References





DOI: https://doi.org/10.17485/ijst%2F2018%2Fv11i40%2F120367