Open Access Open Access  Restricted Access Subscription Access

Machine Learning Approach-Based Big Data Imputation Methods for Outdoor Air Quality Forecasting


Affiliations
1 Department of Mathematics, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam 612 001, Tamil Nadu, India
2 Department of Computer Science and Engineering, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam 612 001, Tamil Nadu, India
 

Missing data from ambient air databases is a typical issue, but it is much worse in small towns or cities. Missing data is a significant concern for environmental epidemiology. These settings have high pollution exposure levels worldwide, and dataset gaps obstruct health investigations that could later affect local and international policies. When a substantial number of observations contain missing values, the standard errors increase due to the smaller sample size, which may significantly affect the final result. Generally, the performance of various missing value imputation algorithms is proportional to the size of the database and the percentage of missing values within it. This paper proposes and demonstrates an ensemble – imputation – classification framework approach to rebuild air quality information using a dataset from Beijing, China, to forecast air quality. Various single and multiple imputation procedures are utilized to fill the missing records. Then ensemble of diverse classifiers is used on the imputed data to find the air pollution level. The recommended model aims to reduce the error rate and improve accuracy. Extensive testing of datasets with actual missing values has revealed that the suggested methodology significantly enhances the air quality forecasting model’s accuracy with multiple imputation and ensemble techniques when compared to other conventional single imputation techniques.

Keywords

Air Quality, Big Data Analytics, Classification, Ensemble, Multiple Imputation.
User
Notifications
Font Size

  • Hu J, Changes in air pollutants during the COVID-19 lockdown in Beijing: Insights from a machine-learning technique and implications for future control policy, Atmos Ocean Sci Lett, 14(4) (2021) 100060.
  • Liu Y & Gopalakrishnan V, An overview and evaluation of recent machine learning imputation methods using cardiac imaging data, Data, 2(1) (2017).
  • Zhao L, Liang H R, Chen F Y, Chen Z, Guan W J & Li J H, Association between air pollution and cardiovascular mortality in China: a systematic review and meta-analysis, Oncotarget, 8(39) (2017) 66438–48.
  • Pollice A & Lasinio G J, Two approaches to imputation and adjustment of air quality data from a composite monitoring network, J Data Sci, 7(1) (2021) 43–59.
  • Sabarivani A, Ramadevi R, Pandian R & Krishnamoorthy N R, Effect of data preprocessing in the detection of epilepsy using machine learning techniques, J Sci Ind Res, 80(12) (2021) 1066–1077.
  • Tran C T, Zhang M, Andreae P & Xue B, Multiple imputation and genetic programming for classification with incomplete data, GECCO 2017 - Proc 2017 Genet Evol Comput Conf 2017, 521–528.
  • Zainuri N A, Jemain A A & Muda N, A comparison of various imputation methods for missing values in air quality data, Sains Malaysiana, 44(3) (2015) 449–456.
  • Yan Y, Wu Y, Du X & Zhang Y, Incomplete data ensemble classification using imputation-revision framework with local spatial neighborhood information, Appl Soft Comput, 99 (2021) 106905.
  • Aleryani A, Wang W & Iglesia B, Multiple imputation ensembles (MIE) for dealing with missing data, SN Comput Sci, 1(3) (2020) 1–20.
  • Jadhav A, Pramod D & Ramanathan K, Comparison of performance of data imputation methods for numeric dataset, Appl Artif Intell, 33(10) (2019) 913–933.
  • Liu X, Wang X, Zou L, Xia J & Pang W, Spatial imputation for air pollutants data sets via low rank matrix completion algorithm, Environ Int, 139 (2019) 105713.
  • Pena M, Ortega P & Orellana M, A novel imputation method for missing values in air pollutant time series data, 2019 IEEE Lat Am Conf Comput Intell (IEEE) 2019, 1–6.
  • Ma Q & Ghosh S K, EMFlow: Data imputation in latent space via em and deep flow models, arXiv preprint arXiv:2106.04804 (2021).
  • Yuan H, Xu G, Yao Z, Jia J & Zhang Y, Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks, in Proc 2018 ACM Int Joint Conf 2018 Int Symp Pervas Ubiquit Comput Wear Comput 2018, 1293–1300, https://doi.org/10.1145/ 3267305.3274648.
  • Khan S S, Ahmad A & Mihailidis A, Bootstrapping and multiple imputation ensemble approach for classification problems, J Intell Fuzzy Syst, 37(6) (2019) 7769–7783.
  • Kainthura P & Sharma N, Machine learning techniques to predict slope failures in Uttarkashi, Uttarakhand (India), J Sci Ind Res (India), 80(1) (2021) 66–74.
  • Xu Y, Yang W & Wang J, Air quality early-warning system for cities in China, Atmos Environ, 148 (2017) 239–257.
  • Lorenzo J S L, Tam W W S & Seow W J, Association between air quality, meteorological factors and COVID-19 infection case numbers, Environ Res, 197 (2021) 111024.
  • Dutta S, Ghosh S & Dinda S, Urban air-quality assessment and inferring the association between different factors: a comparative study among Delhi, Kolkata and Chennai Megacity of India, Aerosol Sci Eng, 5 (2021) 93–111.

Abstract Views: 132

PDF Views: 81




  • Machine Learning Approach-Based Big Data Imputation Methods for Outdoor Air Quality Forecasting

Abstract Views: 132  |  PDF Views: 81

Authors

Narasimhan D
Department of Mathematics, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam 612 001, Tamil Nadu, India
Vanitha M
Department of Computer Science and Engineering, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam 612 001, Tamil Nadu, India

Abstract


Missing data from ambient air databases is a typical issue, but it is much worse in small towns or cities. Missing data is a significant concern for environmental epidemiology. These settings have high pollution exposure levels worldwide, and dataset gaps obstruct health investigations that could later affect local and international policies. When a substantial number of observations contain missing values, the standard errors increase due to the smaller sample size, which may significantly affect the final result. Generally, the performance of various missing value imputation algorithms is proportional to the size of the database and the percentage of missing values within it. This paper proposes and demonstrates an ensemble – imputation – classification framework approach to rebuild air quality information using a dataset from Beijing, China, to forecast air quality. Various single and multiple imputation procedures are utilized to fill the missing records. Then ensemble of diverse classifiers is used on the imputed data to find the air pollution level. The recommended model aims to reduce the error rate and improve accuracy. Extensive testing of datasets with actual missing values has revealed that the suggested methodology significantly enhances the air quality forecasting model’s accuracy with multiple imputation and ensemble techniques when compared to other conventional single imputation techniques.

Keywords


Air Quality, Big Data Analytics, Classification, Ensemble, Multiple Imputation.

References