Open Access Open Access  Restricted Access Subscription Access

Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time


Affiliations
1 Department of Information Systems, Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt
2 Department of Computer Science, Fayoum University, Cairo, Egypt
3 Department of Information Systems, Fayoum University, Cairo, Egypt
 

Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.

Keywords

Hadoop, Hash Semi Join, MapReduce, Two-Way Join
User

Abstract Views: 193

PDF Views: 0




  • Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Abstract Views: 193  |  PDF Views: 0

Authors

Marwa Hussien Mohamed
Department of Information Systems, Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt
Mohamed Helmy Khafagy
Department of Computer Science, Fayoum University, Cairo, Egypt
Mohamed Hasan Ibrahim
Department of Information Systems, Fayoum University, Cairo, Egypt

Abstract


Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.

Keywords


Hadoop, Hash Semi Join, MapReduce, Two-Way Join



DOI: https://doi.org/10.17485/ijst%2F2018%2Fv11i18%2F174274