Open Access Open Access  Restricted Access Subscription Access

Data Warehouse and Big Data Integration


Affiliations
1 Distrial F.J.C University, Bogota, Colombia
 

Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear; others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.

Keywords

Big Data, Data Warehouse, Integration, Hadoop, NoSql, MapReduce, 7V’s, 3C’s, M&G.
User
Notifications
Font Size

  • P. Bedi, V. Jindal, and A. Gautam, “Beginning with Big Data Simplified,” 2014.
  • R. Kimball, M. Ross, W. Thorthwaite, B. Becker, and M. J, The Data Warehouse Lifecycle Toolkit, 2nd Edition. 2008.
  • C. Todman, Designing A Data Warehouse: Supporting Customer Relationship Management. 2001.
  • W. H. Inmon, Building the Data Warehouse, 4th Edition. 2005.
  • “Oracle Database 12c for Data Warehousing and Big Data .” [Online]. Available: http://www.oracle.com/technetwork/database/bi-datawarehousing/data-warehousing-wp-12c-1896097.pdf. [Accessed: 09-Sep-2015].
  • M. Cox and D. Ellsworth, “Application-Controlled Demand Paging for Out-of-Core Visualization,” 1997. [Online]. Available: http://www.nas.nasa.gov/assets/pdf/techreports/1997/nas-97-010.pdf. [Accessed: 09-Apr-2015].
  • S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” ACM SIGMOD Rec., vol. 26, no. 1, pp. 65–74, 1997.
  • T. Maiorescu, “General Information on Business Intelligence,” pp. 294–297, 2010.
  • “Data Warehouses and OLAP: Concepts, Architectures and Solutions: 9781599043647: Library and Information Science Books | IGI Global.” .
  • Y. Demchenko, C. De Laat, and P. Membrey, “Defining Architecture Components of the Big Data Ecosystem,” Collab. Technol. Syst. (CTS), 2014 Int. Conf., pp. 104–112, 2014.
  • G. NBD-PWG, “ISO/IEC JTC 1 Study Group on Big Data,” 2013. [Online]. Available: http://bigdatawg.nist.gov/cochairs.php. [Accessed: 24-Oct-2015].
  • D. L. W.H. Inmon, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault. Amsterdam,Boston: Elsevier, 2014.
  • G. N. W.H. Inmon, Derek Strauss, DW 2.0: The Architecture for the Next Generation of Data Warehousing (Morgan Kaufman Series in Data Management Systems) (): : Books. Burlington, USA: Morgan Kaufmann Publishers Inc., 2008.
  • R. Kimball, “The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics,” Kimball Gr., 2011.
  • M. Muntean and T. Surcel, “Agile BI - The Future of BI,” Inform. Econ., vol. 17, no. 3, pp. 114–124, 2013.
  • D. Agrawal, “The Reality of Real-Time Business Intelligence,” in Business Intelligence for the RealTime Enterprise, vol. 27 , M. Castellanos, U. Dayal, and T. Sellis, Eds. Springer Berlin Heidelberg , 2009, pp. 75–88.
  • R. Castillo, J. Morata, and L. del Arbol, “Operational Data Store (ODS) - 933.pdf,” Actas del III taller nacional de minería de datos y aprendizaje, pp. 359–365, 2005.
  • S. YiChuan and X. Yao, “Research of Real-time Data Warehouse Storage Strategy Based on Multilevel Caches,” Phys. Procedia, vol. 25, no. 0, pp. 2315–2321, 2012.
  • A. Ma. P. Díaz-zorita, “Evaluación de la herramienta de código libre Apache Hadoop,” Universidad Carlos III de Madrid Escuela Politécnica Superior, 2011.
  • R. Kimball, “Newly Emerging Best Practices for Big Data,” Kimball Group, p. 14, 2012.
  • M. Maier, “Towards a Big Data Reference Architecture,” no. October, pp. 1–144, 2013.
  • O. Corporation, “ORACLE ENTERPRISE ARCHITECTURE WHITE PAPER. An Enterprise Architect ’ s Guide to Big Data,” no. February, 2015.
  • F. Kramer, H. Muller, and K. Turowski, “Acceleration of Single Inserts for Columnar Databases An Experiment on Data Import Performance Using SAP HANA,” in Signal-Image Technology and Internet-Based Systems (SITIS), 2014 Tenth International Conference on, 2014, pp. 672–676.
  • M. R. Patil and F. Thia, Pentaho for Big Data Analytics, vol. 2013. PACKT PUBLISHING, 2013.
  • S. G. Manikandan and S. Ravi, “Big Data Analysis Using Apache Hadoop,” in IT Convergence and Security (ICITCS), 2014 International Conference on , 2014, pp. 1–4.
  • J. Nandimath, E. Banerjee, A. Patil, P. Kakade, and S. Vaidya, “Big data analysis using Apache Hadoop,” 2013 IEEE 14th Int. Conf. Inf. Reuse Integr., pp. 700–703, 2013.
  • A. Katal, M. Wazid, and R. H. Goudar, “Big data: Issues, challenges, tools and Good practices,” in Contemporary Computing (IC3), 2013 Sixth International Conference on , 2013, pp. 404–409.
  • A. Pal and S. Agrawal, “An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce .,” pp. 442–447, 2014.
  • R. Zhang, D. Hildebrand, and R. Tewari, “In unity there is strength: Showcasing a unified big data platform with MapReduce Over both object and file storage,” in Big Data (Big Data), 2014 IEEE International Conference on , 2014, pp. 960–966.
  • “Welcome to ApacheTM Hadoop®!” [Online]. Available: https://hadoop.apache.org/. [Accessed: 26-Mar-2015].
  • “HDFS Architecture Guide.” [Online]. Available: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. [Accessed: 26-Mar-2015].
  • S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” in Computer Networks and ISDN Systems, 1998, pp. 107–117.
  • D. Garlasu, V. Sandulescu, I. Halcu, G. Neculoiu, O. Grigoriu, M. Marinescu, and V. Marinescu, “A big data implementation based on Grid computing,” in Roedunet International Conference (RoEduNet), 2013 11th, 2013, pp. 1–4.
  • A. Jorgensen, C. Price, B. Mitchell, and J. Rowlan, Microsoft Big Data Solutions. John Wiley & Sons, Inc., 2014.
  • R. T. Kaushik, M. Bhandarkar, and K. Nahrstedt, “Evaluation and Analysis of GreenHDFS: A SelfAdaptive, Energy-Conserving Variant of the Hadoop Distributed File System,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp. 274–287.
  • J. G. Shanahan and L. Dai, “Large Scale Distributed Data Science Using Apache Spark,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 2323–2324.
  • R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: SQL and Rich Analytics at Scale,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 13–24.
  • J. Li, J. Wu, X. Yang, and S. Zhong, “Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce,” in Parallel Processing (ICPP), 2015 44th International Conference on, 2015, pp. 939–948.
  • E. Brewer, “CAP Twelve Years Later: How the ‘Rules’ Have Changed,” InfoQ, 2012. [Online]. Available: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed. [Accessed: 26-Mar-2015].
  • G. Vaish, Getting started with NoSQL. 2013.
  • V. N. Gudivada, D. Rao, and V. V. Raghavan, “NoSQL Systems for Big Data Management,” 2014 IEEE World Congr. Serv., pp. 190–197, 2014.
  • Cassandra, “The Apache Cassandra Project,” httpcassandraapacheorg, 2010. [Online]. Available: http://cassandra.apache.org/.
  • D. Borthakur, “Petabyte Scale Databases and Storage Systems at Facebook,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1267–1268.
  • J. Huang, X. Ouyang, J. Jose, M. Wasi-ur-Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, “High-Performance Design of HBase with RDMA over InfiniBand,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, 2012, pp. 774–785.
  • G. Weintraub, “Dynamo and BigTable - Review and comparison,” Electr. Electron. Eng. Isr. (IEEEI), 2014 IEEE 28th Conv., pp. 1–5, 2014.
  • D. Pereira, P. Oliveira, and F. Rodrigues, “Data warehouses in MongoDB vs SQL Server: A comparative analysis of the querie performance,” in Information Systems and Technologies (CISTI), 2015 10th Iberian Conference on, 2015, pp. 1–7.
  • K. Dehdouh, F. Bentayeb, O. Boussaid, and N. Kabachi, “Columnar NoSQL CUBE: Agregation operator for columnar NoSQL data warehouse,” in Systems, Man and Cybernetics (SMC), 2014 IEEE International Conference on, 2014, pp. 3828–3833.
  • Y. Liu and T. M. Vitolo, “Graph Data Warehouse: Steps to Integrating Graph Databases Into the Traditional Conceptual Structure of a Data Warehouse,” in Big Data (BigData Congress), 2013 IEEE International Congress on, 2013, pp. 433–434.
  • M. Chevalier, M. El Malki, A. Kopliku, O. Teste, and R. Tournier, “Benchmark for OLAP on NoSQL technologies comparing NoSQL multidimensional data warehousing solutions,” in Research Challenges in Information Science (RCIS), 2015 IEEE 9th International Conference on, 2015, pp. 480–485.
  • F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, “SAP HANA Database: Data Management for Modern Business Applications,” SIGMOD Rec., vol. 40, no. 4, pp. 45–51, 2012.
  • K. M. A. Hasan, M. T. Omar, S. M. M. Ahsan, and N. Nahar, “Chunking implementation of extendible array to handle address space overflow for large multidimensional data sets,” in Electrical Information and Communication Technology (EICT), 2013 International Conference on, 2014, pp. 1–6.
  • S. Müller and H. Plattner, “An In-depth Analysis of Data Aggregation Cost Factors in a Columnar Inmemory Database,” in Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, 2012, pp. 65–72.
  • H. Plattner, “A Common Database Approach for OLTP and OLAP Using an In-memory Column Database,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 2009, pp. 1–2.
  • J. Schaffner, A. Bog, J. Krüger, and A. Zeier, “A Hybrid Row-Column OLTP Database Architecture for Operational Reporting,” in Business Intelligence for the Real-Time Enterprise SE - 5, vol. 27, M. Castellanos, U. Dayal, and T. Sellis, Eds. Springer Berlin Heidelberg, 2009, pp. 61–74.
  • V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator,” in Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 5:1–5:16.
  • “Apache Pig Philosophy.” [Online]. Available: http://pig.apache.org/philosophy.html. [Accessed: 26-Mar-2015].
  • “Architecture - Apache Drill.” [Online]. Available: http://drill.apache.org/architecture/. [Accessed: 26-Mar-2015].
  • “Storm, distributed and fault-tolerant realtime computation.” [Online]. Available: https://storm.apache.org/. [Accessed: 26-Mar-2015].
  • “Apache Hive TM.” [Online]. Available: https://hive.apache.org/. [Accessed: 26-Mar-2015].
  • “Sqoop -.” [Online]. Available: http://sqoop.apache.org/. [Accessed: 26-Mar-2015].
  • “Impala.” [Online]. Available: http://www.cloudera.com/content/cloudera/en/products-and-services/ cdh/impala.html. [Accessed: 26-Mar-2015].
  • “Apache Thrift - Home.” [Online]. Available: https://thrift.apache.org/. [Accessed: 26-Mar-2015].
  • “Apache ZooKeeper - Home.” [Online]. Available: https://zookeeper.apache.org/. [Accessed: 26-Mar-2015].
  • D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 2011, pp. 1071–1080.
  • B. Ghit, A. Iosup, and D. Epema, “Towards an Optimized Big Data Processing System,” in Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, 2013, pp. 83–86.
  • P. Agarwal, G. Shroff, and P. Malhotra, “Approximate Incremental Big-Data Harmonization,” in Big Data (BigData Congress), 2013 IEEE International Congress on, 2013, pp. 118–125.
  • Y. Elshater, P. Martin, D. Rope, M. McRoberts, and C. Statchuk, “A Study of Data Locality in YARN,” 2015 IEEE Int. Congr. Big Data, pp. 174–181, 2015.
  • A. H. B. James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Glob. Inst., no. June, p. 156, 2011.
  • J. S. Marron, “Big Data in context and robustness against heterogeneity,” Econom. Stat., vol. 2, pp. 73–80, 2017.
  • L. Kugler, “What Happens When Big Data Blunders?,” Commun. ACM, vol. 59, no. 6, pp. 15–16, 2016.
  • S. Sagiroglu, R. Terzi, Y. Canbay, and I. Colak, “Big data issues in smart grid systems,” in 2016 IEEE International Conference on Renewable Energy Research and Applications (ICRERA), 2016, pp. 1007–1012.
  • A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, methods, and analytics,” Int. J. Inf. Manage., vol. 35, no. 2, pp. 137–144, 2015.
  • Jameela Al-Jaroodi, Brandon Hollein, Nader Mohamed, "Applying software engineering processes for big data analytics applications development", Computing and Communication Workshop and Conference (CCWC) 2017 IEEE 7th Annual, pp. 1-7, 2017

Abstract Views: 600

PDF Views: 305




  • Data Warehouse and Big Data Integration

Abstract Views: 600  |  PDF Views: 305

Authors

Sonia Ordonez Salinas
Distrial F.J.C University, Bogota, Colombia
Alba Consuelo Nieto Lemus
Distrial F.J.C University, Bogota, Colombia

Abstract


Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear; others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.

Keywords


Big Data, Data Warehouse, Integration, Hadoop, NoSql, MapReduce, 7V’s, 3C’s, M&G.

References