Open Access Open Access  Restricted Access Subscription Access

Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented on Different Hadoop Platforms


Affiliations
1 Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
 

This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.

Keywords

Entity Resolution, Hadoop, Mapreduce, Transitive Closure, HDFS, Cloudera, Talend.
User
Notifications
Font Size

  • Talburt, J. R., & Zhou, Y. (2013). A practical guide to entity resolution with OYSTER. In Handbook of Data Quality (pp. 235-270). Springer, Berlin, Heidelberg
  • Zhong, B., & Talburt, J. (2018, December). Using Iterative Computation of Connected Graph Components for Post-Entity Resolution Transitive Closure. In 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 164-168). IEEE.
  • Nelson, E. D., & Talburt, J. R. (2011). Entity resolution for longitudinal studies in education using OYSTER. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE) (p. 1). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).
  • Christen, P. (2012). "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication," in IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-1555, Sept. 2012, doi: 10.1109/TKDE.2011.127.
  • Wang, P., Pullen, D., Talburt, J., & Wu, N. (2015). Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE) (p. 187). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).
  • Osesina, O. I., & Talburt, J. (2012). A Data-Intensive Approach to Named Entity Recognition Combining Contextual and Intrinsic Indicators. International Journal of Business Intelligence Research (IJBIR), 3(1), 55-71. doi:10.4018/jbir.2012010104
  • OYSTER Open Source Project, https://bitbucket.org/oysterer/oyster/
  • Kolb, L., Sehili, Z., & Rahm, E. (2014). Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum, 14(2), 107-117.
  • Manikandan, S. G., & Ravi, S. (2014, October). Big data analysis using Apache Hadoop. In 2014 International Conference on IT Convergence and Security (ICITCS) (pp. 1-4). IEEE.
  • Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: Efficient deduplication with hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.
  • Chen, X, Schallehn, E, Saake, G. (2018). Cloud-Scale Entity Resolution:Current State and Open Challenges, Open Journal of Big Data (OJBD) Volume 4, (Issue 1), (Available at http://www.ronpub.com/ojbd/ ;ISSN 2365-029X).
  • Kolb, L., Thor, A., & Rahm, E. (2012, April). Load balancing for mapreduce-based entity resolution. In 2012 IEEE 28th international conference on data engineering (pp. 618-629). IEEE.
  • Cloudera. (2019). Cloudera Enterprise 5.15.x documentation, (available at https://docs.cloudera.com/documentation/enterprise/5-15-x.html); retrieved December 11, 2019
  • Talend Big Data Sandbox, https://www.talend.com/products/big-data/real-time-big-data/, retrieved December 10, 2019.
  • Amazon Web Services. (2019)., (available at https://en.wikipedia.org/wiki/Amazon_Web_Services); retrieved November 1,2019.
  • Salinas, S. O., & Lemus, A. C. (2017). Data warehouse and big data integration. Int. Journal of Comp. Sci. and Inf. Tech, 9(2), 1-17.
  • Chen, C., Pullen, D., Petty, R. H., & Talburt, J. R. (2015, November). Methodology for Large-Scale Entity Resolution without Pairwise Matching. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW) (pp. 204-210). IEEE.
  • Thomas Seidl, Brigitte Boden, and Sergej Fries. (2012). CC-MR - finding connected components in huge graphs with MapReduce. In Proceedings of the 2012th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECMLPKDD’12). Springer-Verlag, Berlin, Heidelberg, 458–473.
  • Hsueh, S. C., Lin, M. Y., & Chiu, Y. C. (2014, January). A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing-Volume 152 (pp. 3-9).
  • Elsayed, T., Lin, J., & Oard, D. W. (2008, June). Pairwise document similarity in large collections with MapReduce. In Proceedings of ACL-08: HLT, Short Papers (pp. 265-268).
  • Syed, H., Wang, Talburt, J.R., Liu, F., Pullen, D., Wu,N. (2012). Developing and refining matching rules for entity resolution, in Proceedings of the International Conference on Information and knowledge Engineering (IKE), Las Vegas, NV
  • Gupta T., Deshpande V. (2020) Entity Resolution for Maintaining Electronic Medical Record Using OYSTER. In: Haldorai A., Ramu A., Mohanram S., Onn C. (eds) EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing. EAI/Springer Innovations in Communication and Computing. Springer, Cham.
  • Muniswamaiah, M., Agerwala, T., and Tappert, C. (2019). Big data in cloud computing review and opportunities. Int. Journal of Comp Sci and Inf. Tech, 11(4), 43-57.
  • Zhou, Y., & Talburt, J. R. (2014). Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning. In Yeoh, W., Talburt, J. R., & Zhou, Y. (Ed.), Information Quality and Governance for Business Intelligence (pp. 329-351). IGI Global. http://doi:10.4018/978-1-4666-4892-0.ch017
  • Efthymiou, Vasilis & Papadakis, George & Papastefanatos, George & Stefanidis, Kostas & Palpanas, Themis. (2017). Parallel Meta-blocking for Scaling Entity Resolution over Big Heterogeneous Data. Information Systems. 65. 137-157.

Abstract Views: 239

PDF Views: 139




  • Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented on Different Hadoop Platforms

Abstract Views: 239  |  PDF Views: 139

Authors

Purvi Parmar
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
MaryEtta Morris
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
John R. Talburt
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
Huzaifa F. Syed
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States

Abstract


This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.

Keywords


Entity Resolution, Hadoop, Mapreduce, Transitive Closure, HDFS, Cloudera, Talend.

References