Author Details

Data in this era is generating at tremendous rate so now it is need of today to handle the data to gain useful insight, this data can be useful for researcher and accommodation to do analysis. As we know traditional system cannot handle more than terabytes of data since it affects performance and also storage is very costly. Bigdata is a innovative technique analyze, store, manage, distribute and capture datasets. To achieve compressed storage in this implement a parallel mining algorithm called as enhancement of parallel mining using Hadoop. Hadoop is a platform which enables the distributing processing using mapreduce programming. This help in getting result at very fast rate as result in less time help in competing for growth of business. For the analysis in this paper unstructured datasets from real-time is taken and converted to structured format and process in mapreduce. It is found in literature existing mining algorithm for real time datasets lacks in fault tolerance, load balancing, data distribution and automatic parallelization. To overcome these disadvantages we implement mapreduce for association analysis. In EPH we improve performance by distributing load across the computing nodes .In our proposed solution we use real-world celestial spectral data .The graphical representation of traditional system comparison with Hadoop is shown in this paper.

Keywords

Bigdata, Hadoop, Mapreduce, Parallel Mining, Association Analysis, Enhancement of Parallel Mining using Hadoop(EPH).

Full Text

Machine Learning Approach for Unstructured Data Using Hive

Abstract Views :198 | PDF Views:1

Authors

Neha Mangla ¹, Shanthi Mahesh ¹, M. Chhaya ¹, G. Vidyashree ¹, Vikas ¹

Affiliations
1 Atria Institute of Technology, Bangalore, IN

Source

International Journal of Engineering Research, Vol 5, No SP 4 (2016), Pagination: 801-807

Abstract

Voluminous amount of structured, semistructured and unstructured data sets that have the potential to learn the relationship among data in the area of business is being collected rapidly; termed as big data. The storage of large chunks of data is difficult as even terabytes and petabytes of traditional data warehousing solutions is insufficient and exorbitant.
It is viable to store and process these ransom amount of data on Hadoop; which is a low cost, reliable, scalable and fault tolerant Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop implements MapReduce programing model for storing and processing large data sets with a parallel, distributed algorithm on commodity hardware. Nevertheless, the programming model expects the developers to write bespoke programs that are less flexible, time consuming, hard to code; maintain and reuse. This challenging task of writing complex MapReduce codes was rationalized by making use of HiveQL.
Hive is the platform required to run HiveQL. Hive is built on top of Hadoop to query Big Data. Internally the Hive queries are converted into the corresponding MapReduce task.
In this paper, by making use of machine learning algorithm a movie rating prediction system is built based on MovieLens dataset.

Keywords

Big Data, HDFS, Hadoop, Hive, MapReduce, Linear Regression.

Username
Password
Remember me