Open Access Open Access  Restricted Access Subscription Access

Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented on Different Hadoop Platforms


Affiliations
1 Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
 

This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.

Keywords

Entity Resolution, Hadoop, Mapreduce, Transitive Closure, HDFS, Cloudera, Talend.
User
Notifications
Font Size


  • Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Implemented on Different Hadoop Platforms

Abstract Views: 421  |  PDF Views: 204

Authors

Purvi Parmar
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
MaryEtta Morris
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
John R. Talburt
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States
Huzaifa F. Syed
Center for Advanced Research in Entity Resolution and Information Quality University of Arkansas at Little Rock Little Rock, Arkansas, United States

Abstract


This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.

Keywords


Entity Resolution, Hadoop, Mapreduce, Transitive Closure, HDFS, Cloudera, Talend.

References