Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Enhancing HiveQL Engine Using Map-Join-Reduce


Affiliations
1 Pune Institute of Computer Technology College, Pune, Maharashtra, India
2 Department of Information Technology, PICT, Pune, India
3 PICT College of Engineering, Pune, India
     

   Subscribe/Renew Journal


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
This HiveQL is allowing enhancement of MapReduce to MapJoinReduce for our convenience. This will lead us for detailed study of performance improvement.
The programmer is only required to write specialized map and reduce functions as part of the Map/Reduce job. Framework takes care of the rest. But MapReduce finds performance issue. The performance issue is mainly due to MapReduce sequential data processing strategy which frequently checkpoints and shuffles intermediate results in data processing. So MapReduce can be improved to increase scalability and efficiency.
And proposed solution is Map-Join-Reduce. Map-Join-Reduce remove the burden of presenting complex join algorithms to the system. We first proposed filter-join-aggregate mathematical model which is an extension of MapReduce model. To support this mathematical model we present a MapJoinReduce architecture design for HiveQL engine. This architecture design will put light on strategy of query processing by Hive system and Hadoop system.
Benefit of this approach is minimized check pointing and shuffling of intermediate result and further more improves performance of system.

Keywords

CPU and Memory Analysis, Hadoop, HiveQL.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 213

PDF Views: 3




  • Enhancing HiveQL Engine Using Map-Join-Reduce

Abstract Views: 213  |  PDF Views: 3

Authors

Amruta Kulkarni
Pune Institute of Computer Technology College, Pune, Maharashtra, India
Shweta C. Dharmadhikari
Department of Information Technology, PICT, Pune, India
M. Emmanuel
PICT College of Engineering, Pune, India

Abstract


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
This HiveQL is allowing enhancement of MapReduce to MapJoinReduce for our convenience. This will lead us for detailed study of performance improvement.
The programmer is only required to write specialized map and reduce functions as part of the Map/Reduce job. Framework takes care of the rest. But MapReduce finds performance issue. The performance issue is mainly due to MapReduce sequential data processing strategy which frequently checkpoints and shuffles intermediate results in data processing. So MapReduce can be improved to increase scalability and efficiency.
And proposed solution is Map-Join-Reduce. Map-Join-Reduce remove the burden of presenting complex join algorithms to the system. We first proposed filter-join-aggregate mathematical model which is an extension of MapReduce model. To support this mathematical model we present a MapJoinReduce architecture design for HiveQL engine. This architecture design will put light on strategy of query processing by Hive system and Hadoop system.
Benefit of this approach is minimized check pointing and shuffling of intermediate result and further more improves performance of system.

Keywords


CPU and Memory Analysis, Hadoop, HiveQL.