Open Access Open Access  Restricted Access Subscription Access

Machine Learning Approach for Unstructured Data Using Hive


Affiliations
1 Atria Institute of Technology, Bangalore, India
 

Voluminous amount of structured, semistructured and unstructured data sets that have the potential to learn the relationship among data in the area of business is being collected rapidly; termed as big data. The storage of large chunks of data is difficult as even terabytes and petabytes of traditional data warehousing solutions is insufficient and exorbitant.
It is viable to store and process these ransom amount of data on Hadoop; which is a low cost, reliable, scalable and fault tolerant Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop implements MapReduce programing model for storing and processing large data sets with a parallel, distributed algorithm on commodity hardware. Nevertheless, the programming model expects the developers to write bespoke programs that are less flexible, time consuming, hard to code; maintain and reuse. This challenging task of writing complex MapReduce codes was rationalized by making use of HiveQL.
Hive is the platform required to run HiveQL. Hive is built on top of Hadoop to query Big Data. Internally the Hive queries are converted into the corresponding MapReduce task.
In this paper, by making use of machine learning algorithm a movie rating prediction system is built based on MovieLens dataset.

Keywords

Big Data, HDFS, Hadoop, Hive, MapReduce, Linear Regression.
User
Notifications
Font Size

Abstract Views: 231

PDF Views: 1




  • Machine Learning Approach for Unstructured Data Using Hive

Abstract Views: 231  |  PDF Views: 1

Authors

Neha Mangla
Atria Institute of Technology, Bangalore, India
Shanthi Mahesh
Atria Institute of Technology, Bangalore, India
M. Chhaya
Atria Institute of Technology, Bangalore, India
G. Vidyashree
Atria Institute of Technology, Bangalore, India
Vikas
Atria Institute of Technology, Bangalore, India

Abstract


Voluminous amount of structured, semistructured and unstructured data sets that have the potential to learn the relationship among data in the area of business is being collected rapidly; termed as big data. The storage of large chunks of data is difficult as even terabytes and petabytes of traditional data warehousing solutions is insufficient and exorbitant.
It is viable to store and process these ransom amount of data on Hadoop; which is a low cost, reliable, scalable and fault tolerant Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop implements MapReduce programing model for storing and processing large data sets with a parallel, distributed algorithm on commodity hardware. Nevertheless, the programming model expects the developers to write bespoke programs that are less flexible, time consuming, hard to code; maintain and reuse. This challenging task of writing complex MapReduce codes was rationalized by making use of HiveQL.
Hive is the platform required to run HiveQL. Hive is built on top of Hadoop to query Big Data. Internally the Hive queries are converted into the corresponding MapReduce task.
In this paper, by making use of machine learning algorithm a movie rating prediction system is built based on MovieLens dataset.

Keywords


Big Data, HDFS, Hadoop, Hive, MapReduce, Linear Regression.