Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

A Comparison of Missing Data Handling Techniques


Affiliations
1 Department of Information Technology, Sona College of Technology, India
2 Department of Computer Applications, Sona College of Arts and Science, India
     

   Subscribe/Renew Journal


Missing data is a regular concern on data that professionals have to deal with. Efficient analysis techniques have to be followed to find interesting patterns. In this study, we are comparing 16 different imputation methods namely Linear, Index, Values, Nearest, Zero, slinear, Quadratic, Cubic, Barycentric, Krogh, Polynomial, Spline, Piecewise Polynomial, From derivatives, Pchip and Akima. These techniques are performed on real time UCI dataset and are under Missing Completely at a Random (MCAR) assumption, our result suggests the nearest, zero, quadratic and polynomial imputation methods which provides above 96% of accuracy when compared to the other techniques.

Keywords

Missing Data, Imputation Methods, Missing Completely at Random.
Subscription Login to verify subscription
User
Notifications
Font Size

  • R.J. Little and D.B. Rubin, “Statistical Analysis with Missing Data”, Wiley Press, 2019.
  • J. Sim, J.S. Lee and O. Kwon, “Missing Values and Optimal Selection of an Imputation Method and Classification Algorithm to Improve the Accuracy of Ubiquitous Computing Applications”, Mathematical problems in Engineering, Vol. 2015, pp. 1-18, 2015.
  • Peter Schmitt, Jonas Mandel and Mickael Guedj, “A Comparison of Six Methods for Missing Data Imputation”, Journal of Biometrics and Biostatistics, Vol. 6, No. 1, pp. 1-6, 2015.
  • Xueying Xu, Leizhen Xia, Qimeng Zhang, Shaoning Wu, Mingcheng Wu and Hongbo Liu, “The Ability of Different Imputation Methods for Missing Values in Mental Measurement Questionnaires”, BMC Medical Research Methodology, Vol. 20, No. 42, pp. 1-16, 2020.
  • R.M. Thomas, W. Bruin and P. Zhutovsky, “Dealing with Missing Data, Small Sample Sizes, and Heterogeneity in Machine Learning Studies of Brain Disorders”, Academic Press, 2020.
  • J.M. Jerez, I. Molina and P.J. García-Laencina, “Missing Data Imputation using Statistical and Machine Learning Methods in a Real Breast Cancer Problem”, Artificial Intelligence in Medicine, Vol. 50, No. 2, pp. 105-115, 2010.
  • Iris Data Set, Available at https://archive.ics.uci.edu/ml/datasets/Iris, Accessed at 2020.
  • Credit Card Fraud, Available at https://www.kaggle.com/mlg-ulb/ creditcardfraud, Accessed at 2016.
  • Wine Data, Available at https://www.kaggle.com/sgus1318/winedata, Accessed at 2020.
  • The Boston Housing Dataset, Available at https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html, Accessed at 2020.
  • Scipy, Available at https://www.scipy.org, Accessed at 2020.
  • D.B. Rubin, “Inference and Missing Data”, Biometrika, Vol. 63, No. 3, pp. 581-592, 1976.

Abstract Views: 364

PDF Views: 1




  • A Comparison of Missing Data Handling Techniques

Abstract Views: 364  |  PDF Views: 1

Authors

S. David Samuel Azariya
Department of Information Technology, Sona College of Technology, India
V. Mohanraj
Department of Information Technology, Sona College of Technology, India
J. Jeba Emilyn
Department of Information Technology, Sona College of Technology, India
G. Jothi
Department of Computer Applications, Sona College of Arts and Science, India

Abstract


Missing data is a regular concern on data that professionals have to deal with. Efficient analysis techniques have to be followed to find interesting patterns. In this study, we are comparing 16 different imputation methods namely Linear, Index, Values, Nearest, Zero, slinear, Quadratic, Cubic, Barycentric, Krogh, Polynomial, Spline, Piecewise Polynomial, From derivatives, Pchip and Akima. These techniques are performed on real time UCI dataset and are under Missing Completely at a Random (MCAR) assumption, our result suggests the nearest, zero, quadratic and polynomial imputation methods which provides above 96% of accuracy when compared to the other techniques.

Keywords


Missing Data, Imputation Methods, Missing Completely at Random.

References