Open Access Open Access  Restricted Access Subscription Access

Feature Influence Based ETL for Efficient Big Data Management


Affiliations
1 SRM Institute of Science and Technology, Kattankulathur, Chengalpat 603 203, Tamil Nadu, India
 

The increased volume of big data introduces various challenges for its maintenance and analysis. There exist various approaches to the problem, but they fail to achieve the expected results. To improve the big data management performance, an efficient real time feature influence analysis based Extraction, Transform, and Loading (ETL) framework is presented in this article. The model fetches the big data and analyses the features to find noisy records by preprocessing the data set. Further, the method performs feature extraction and applies feature influence analysis to various data nodes and the data present in the data nodes. The method estimates Feature Specific Informative Influence (FSII) and Feature Specific Supportive Influence (FSSI). The value of FSII and FSSI are measured with the support of a data dictionary. The class ontology belongs to various classes of data. The value of FSII is measured according to the presence of a concrete feature on a tuple towards any data node, whereas the value of FSSI is measured based on the appearance of supportive features on any data point towards the data node. Using these measures, the method computes the Node Centric Transformation Score (NCTS). Based on the value of NCTS the method performs map reduction and merging of data nodes. The NCTS_FIA method achieves higher performance in the ETL process. By adapting feature influence analysis in big data management, the ETL performance is improved with the least amount of time complexity.

Keywords

Cloud, FSII, FSSI, FIA, NCTS, Ontology.
User
Notifications
Font Size

  • Fikri N, Rida M, Abghour N, Moussaid K & El Omri A, An adaptive and real-time based architecture for financial data integration, Springer, J Big Data, 6 (2019) Article number 97, https://doi.org/10.1186/s40537-019-0260-x.
  • Souibgui M & Atigui F, Zammali S, Cherfi S & Yahia S B, Data quality in ETL process: A preliminary study, Elsevier, Procedia Comput Sci, 159 (2019) 676–687, https://doi.org/10.1016/j.procs.2019.09.223.
  • Bergamaschi S, Guerra F, Orsini M, Sartori C & Vincini M, A semantic approach to ETL technologies, J Data Knowl Eng, 70 (2011) 717–731.
  • Theodorou V, Abelló A, Lehner W & Thiele M, Quality measures for ETL processes: from goals to implementation, J Concurr Comput Pract Exp, 28 (2016) 3969–3993.
  • Theodorou V, Jovanovic P, Abelló A & Nakuçi E, Data generator for evaluating ETL process quality, J Inf Syst, 63 (2017) 80–100.
  • Mohammed Muddasir N, Raghuveer K & Dayanand R, Study of ETL optimization techniques in big data, Int J Adv Sci Technol, 29(5) (2020), 13194–13209.
  • Liu X, Iftikhar N, Huo H, Nielsen P S & Huo H, Optimizing ETL by a two-level data staging method, Int J Data Warehous Min, 12(3) (2016) 32–50.
  • Kathiravelu P, Sharma A, Galhardas H, Van Roy P & Veiga L, On-demand big data integration: A hybrid ETL approach for reproducible scientific research, Distrib Parallel Databases, 37 (2019) 273–295, https://doi.org/10.48550/arXiv.1804.08985
  • Barkhordari M & Niamanesh M, Chabok: a Map-Reduce based method to solve data warehouse problems, J Big Data, 5 (2018), Article number 40, https://doi.org/10.1186/s40537-018-0144-5.
  • Bansal S K & Kagemann S, Integrating big data: A semantic extract-transform-load framework, Computer (Long. Beach. Calif), 48 (2015) 42–50.
  • Abbes H & Gargouri F, Big data integration: A Mongo DB database and modular ontologies based approach, Procedia Comput Sci 96 (2016) 446–455.
  • Rinaldi A M & Russo C, A semantic-based model to represent multimedia big data, MEDES '18: Proc 10th Int Conf Manag Digit EcoSyst, (September 2018) 31–38, https://doi.org/10.1145/3281375.3281386.
  • Prasser F, Spengler H, Bild R, Eicher J & Kuhn K A, Privacy-enhancing ETL-processes for biomedical data, Int J Med Inform, 126 (2019) 72–81.
  • Galici R, Ordile L, Marchesi M, Pinna A & Tonelli R, Applying the ETL process to blockchain data prospect and findings, Information, 11(4) (2020) 204 https://doi.org/10.3390/info11040204.
  • Faridi Masouleh M, Afshar Kazemi M A, Alborzi M, & Toloie Eshlaghy A, Optimization of ETL process in data warehouse through a combination of parallelization and shared cache memory, Eng Technol Appl Sci Res, 6(6) (2016) 1241–1244, https://doi.org/10.48084/etasr.849.
  • Lee S, Performance analysis of big data ETL process over CPU-GPU heterogeneous architectures, IEEE Trans Parallel Distrib Syst, 42–47 (2020), 10.1109/ICDEW53142. 2021.00015.
  • Biswas N, Sarkar A & Mondal K C, Efficient incremental loading in ETL processing for real-time data integration, Innov Syst Softw Eng, 16 (2019) 53–61.
  • Ma K & Yang B, Column access-aware in-stream data cache with stream processing framework, J Signal Process Syst, 86(2) 191–205 (2017).
  • Zheng T, Chen G, Wang X, Chen C, Wang X & Luo S, Real-time intelligent big data processing: Technology platform and applications, Sci China Inf Sci, 62(8) (2019) 82101.
  • Babar M & Arif F, Real-time data processing scheme using big data analytics in internet of things based smart transportation environment, J Ambient Intell Hum Comput, 10(10) (2019) 4167–4177.
  • Bouali H, Akaichi J & Gaaloul A, Real-time data warehouse loading methodology and architecture: A healthcare use case, Int J Data Anal Techn Strategies, 11(4) (2019) 310–327.
  • Machado G V, Cunha I, Pereira A C M & Oliveira L B, DOD-ETL: Distributed on-demand ETL for near real-time business intelligence, J Internet Serv Appl, 10(1) (2019) Article number 21, https://doi.org/10.1186/s13174-019-0121-z.
  • Rieke M, Bigagli L, Herle S, Jirka S, Kotsev A & Liebig T, Geospatial IoT—The need for event-driven architectures in contemporary spatial data infrastructures, ISPRS Int J Geo-Inf, 7(10) (2018) 385.
  • Manickam V & Rajasekaran Indra M, Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management, Soft Comput (in press), (2022). https://doi.org/10.1007/s00500-022-06938-8.
  • Mehmood E & Anees T, Distributed real-time ETL architecture for unstructured big data, Knowledge and Information System, 64 (2022) 3419–3445.
  • Oliveira B, Leite M, Oliveira Ó & Belo O, A service-oriented framework for ETL implementation, in Progress in Artificial Intelligence. EPIA 2022. Lecture Notes in Computer Science, vol 13566, edited by G Marreiros, B Martins, A Paiva, B Ribeiro, A Sardinha (Springer, Cham) https://doi.org/10.1007/978-3-031-16474-3_52.
  • Jensen S K, Thomsen C, Pedersen T B & Andersen O, Pygrametl: A powerful programming framework for easy creation and testing of ETL flows, Transactions on Large-Scale Data- and Knowledge-Centered Systems (Springer), 12670 (2021) 45–84, https://doi.org/10.1007/978-3-662-63519-3_3.
  • Almeida J R, Coelho L & Oliveira J L, BIcenter: A collaborative Web ETL solution based on a reflective software approach, SoftwareX, 16 (2021) (100892), https://doi.org/10.1016/j.softx.2021.100892.
  • Tanasescu L G, Vines A, Bologa A R & Vaida C A, Big data ETL process and its impact on text mining analysisfor employees, reviews, Appl Sci, 12 (2022) 7509, https://doi.org/10.3390/app12157509.

Abstract Views: 143

PDF Views: 94




  • Feature Influence Based ETL for Efficient Big Data Management

Abstract Views: 143  |  PDF Views: 94

Authors

M Vijayalakshmi
SRM Institute of Science and Technology, Kattankulathur, Chengalpat 603 203, Tamil Nadu, India
R I Minu
SRM Institute of Science and Technology, Kattankulathur, Chengalpat 603 203, Tamil Nadu, India

Abstract


The increased volume of big data introduces various challenges for its maintenance and analysis. There exist various approaches to the problem, but they fail to achieve the expected results. To improve the big data management performance, an efficient real time feature influence analysis based Extraction, Transform, and Loading (ETL) framework is presented in this article. The model fetches the big data and analyses the features to find noisy records by preprocessing the data set. Further, the method performs feature extraction and applies feature influence analysis to various data nodes and the data present in the data nodes. The method estimates Feature Specific Informative Influence (FSII) and Feature Specific Supportive Influence (FSSI). The value of FSII and FSSI are measured with the support of a data dictionary. The class ontology belongs to various classes of data. The value of FSII is measured according to the presence of a concrete feature on a tuple towards any data node, whereas the value of FSSI is measured based on the appearance of supportive features on any data point towards the data node. Using these measures, the method computes the Node Centric Transformation Score (NCTS). Based on the value of NCTS the method performs map reduction and merging of data nodes. The NCTS_FIA method achieves higher performance in the ETL process. By adapting feature influence analysis in big data management, the ETL performance is improved with the least amount of time complexity.

Keywords


Cloud, FSII, FSSI, FIA, NCTS, Ontology.

References