Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Marwa Hussien Mohamed; Mohamed Helmy Khafagy; Mohamed Hasan Ibrahim

The PDF file you selected should load here if your Web browser has a PDF reader plug-in installed (for example, a recent version of Adobe Acrobat Reader).

If you would like more information about how to print, save, and work with PDFs, Highwire Press provides a helpful Frequently Asked Questions about PDFs.

Alternatively, you can download the PDF file directly to your computer, from where it can be opened using a PDF reader. To download the PDF, click the Download link above.

Fullscreen Fullscreen Off

Abstract
References
Article Metrics
Refbacks

Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.

Keywords

Hadoop, Hash Semi Join, MapReduce, Two-Way Join

About the Journal

Editorial Board

Current Issue

Archives

Advanced Search

Article Submission

Registration

Subscription

User

Information

Journal Content
Browse

Donations

Username
Password
Remember me

Username
Password
Remember me

Indian Journal of Science and Technology

Keywords