Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Marwa Hussien Mohamed; Mohamed Helmy Khafagy; Mohamed Hasan Ibrahim

doi:10.17485/ijst/2018/v11i18/174274

Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Marwa Hussien Mohamed ¹, Mohamed Helmy Khafagy ², Mohamed Hasan Ibrahim ³

Affiliations
1 Department of Information Systems, Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt
2 Department of Computer Science, Fayoum University, Cairo, Egypt
3 Department of Information Systems, Fayoum University, Cairo, Egypt

Abstract
References
Article Metrics
Refbacks

Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.