Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Reusing Values in Big Data Frameworks


Affiliations
1 Department of Computer Applications, Muthayammal College of Arts & Science, Namakkal, Tamilnadu, India
     

   Subscribe/Renew Journal


In recent study Big Data has been a very hot and active research during the past few years. It is receiving hard to proficiently execute data investigation task with traditional data warehouse solutions. Parallel dealing out platforms and matching dataflow systems running on top of them are increasingly popular. They have greatly improved the throughput of data analysis tasks. The trade-off is the consumption of more computation resources. Tens or hundreds of nodes run together to execute one task. However, it might still take hours or even days to complete a task. It is very important to improve resource operation and computation effectiveness. According to research conducted by Microsoft, there exist around 40% of common sub-computations in usual workloads. Computation redundancy is a waste of time and resources. Apache Pig is a parallel dataflow system runs on top of Apache Hadoop, which is a parallel processing platform. Pig/Hadoop is one of the most popular combinations used to do large scale data processing. This thesis project proposed a framework which materializes and reuses previous computation results to avoid computation redundancy on top of Pig/Hadoop. The idea came from the materialize view technique in Relational Databases. Computation outputs were selected and stored in the Hadoop File System due to their large size. The effecting statistics of the outputs were stored in MySQL Cluster. The framework used a plan matcher and rewriter module to find the maximally shared common-computation with the query from MySQL Cluster, and rewrite the query with the materialized outputs. The framework was evaluated with the TPC-H Benchmark. The outcome showed that execution time had been considerably condensed by avoiding redundant computation. By reusing sub-computations, the query finishing time was reduced by 85% on average; while it only took around 40 ˜ 55 seconds when reuse whole computations. Besides, the results showed that the overhead is only around 35% on average.

Keywords

Hadoop, PIG, Map Reduce, HDFS, Cluster.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 267

PDF Views: 4




  • Reusing Values in Big Data Frameworks

Abstract Views: 267  |  PDF Views: 4

Authors

P. Surya
Department of Computer Applications, Muthayammal College of Arts & Science, Namakkal, Tamilnadu, India
M. Abivarsha
Department of Computer Applications, Muthayammal College of Arts & Science, Namakkal, Tamilnadu, India
S. Gopinath
Department of Computer Applications, Muthayammal College of Arts & Science, Namakkal, Tamilnadu, India

Abstract


In recent study Big Data has been a very hot and active research during the past few years. It is receiving hard to proficiently execute data investigation task with traditional data warehouse solutions. Parallel dealing out platforms and matching dataflow systems running on top of them are increasingly popular. They have greatly improved the throughput of data analysis tasks. The trade-off is the consumption of more computation resources. Tens or hundreds of nodes run together to execute one task. However, it might still take hours or even days to complete a task. It is very important to improve resource operation and computation effectiveness. According to research conducted by Microsoft, there exist around 40% of common sub-computations in usual workloads. Computation redundancy is a waste of time and resources. Apache Pig is a parallel dataflow system runs on top of Apache Hadoop, which is a parallel processing platform. Pig/Hadoop is one of the most popular combinations used to do large scale data processing. This thesis project proposed a framework which materializes and reuses previous computation results to avoid computation redundancy on top of Pig/Hadoop. The idea came from the materialize view technique in Relational Databases. Computation outputs were selected and stored in the Hadoop File System due to their large size. The effecting statistics of the outputs were stored in MySQL Cluster. The framework used a plan matcher and rewriter module to find the maximally shared common-computation with the query from MySQL Cluster, and rewrite the query with the materialized outputs. The framework was evaluated with the TPC-H Benchmark. The outcome showed that execution time had been considerably condensed by avoiding redundant computation. By reusing sub-computations, the query finishing time was reduced by 85% on average; while it only took around 40 ˜ 55 seconds when reuse whole computations. Besides, the results showed that the overhead is only around 35% on average.

Keywords


Hadoop, PIG, Map Reduce, HDFS, Cluster.