Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

P. Abinaya; R. Jayavadivel

Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

P. Abinaya , R. Jayavadivel

Affiliations
1 Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India

In this paper, K-Nearest Neighbor (K-NN), a supervised web-scale forum crawler is used. This approach helps to identify each forums containing information are originally nested with the data they presented or not. It also helps to remove anonymous informative links from forum data that helps to avoid anonymous web usage and user timing on crawling the WebPages. The goal of systematic way of novel implementation deep Web learning using K-NN in the direction of real-time information with exclusive stage of implications. A focused online based information duplicate records crawler analyzes its move slowly boundary to find the hyperlinks that are in all likelihood to be maximum applicable for the move slowly, and avoids beside the point areas of the web. It identifies the next most important and relevant link to follow by counting on probabilistic models for correctly predicting the relevancy of the file. It can mine a group of duplicate records before selecting a value for an attribute of a normalized record. The overall performance of a focused Duplicate record web page crawling depends at the richness of links inside the specific subject matter being searched by using the user Based on this observation, the web forum crawling problem is reduced to a URL-type recognition problem. And shown how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that K-NN achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying K-NN on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.

Keywords

Web Learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers.

I-Scholar

Journal Help

Subscription Login to verify subscription

User

Notifications

Journal Content
Browse

Font Size

Information

Abstract Views: 271

PDF Views: 0

Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

Abstract Views: 271 | PDF Views: 0

Authors

P. Abinaya
Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India

R. Jayavadivel
Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India

Abstract

Keywords

Web Learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers.

Username
Password
Remember me

Username
Password
Remember me

ICTACT Journal on Soft Computing

ICTACT Journal on Soft Computing

Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

Subscribe/Renew Journal

Keywords

Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

Authors

Abstract

Keywords