Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources


Affiliations
1 Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India
     

   Subscribe/Renew Journal


In this paper, K-Nearest Neighbor (K-NN), a supervised web-scale forum crawler is used. This approach helps to identify each forums containing information are originally nested with the data they presented or not. It also helps to remove anonymous informative links from forum data that helps to avoid anonymous web usage and user timing on crawling the WebPages. The goal of systematic way of novel implementation deep Web learning using K-NN in the direction of real-time information with exclusive stage of implications. A focused online based information duplicate records crawler analyzes its move slowly boundary to find the hyperlinks that are in all likelihood to be maximum applicable for the move slowly, and avoids beside the point areas of the web. It identifies the next most important and relevant link to follow by counting on probabilistic models for correctly predicting the relevancy of the file. It can mine a group of duplicate records before selecting a value for an attribute of a normalized record. The overall performance of a focused Duplicate record web page crawling depends at the richness of links inside the specific subject matter being searched by using the user Based on this observation, the web forum crawling problem is reduced to a URL-type recognition problem. And shown how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that K-NN achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying K-NN on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.

Keywords

Web Learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers.
Subscription Login to verify subscription
User
Notifications
Font Size

Abstract Views: 164

PDF Views: 0




  • Normalization Techniques for Identifying Duplicate Records from Multiple Data Sources

Abstract Views: 164  |  PDF Views: 0

Authors

P. Abinaya
Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India
R. Jayavadivel
Department of Computer Science and Engineering, Vivekanadha College of Engineering for Women, India

Abstract


In this paper, K-Nearest Neighbor (K-NN), a supervised web-scale forum crawler is used. This approach helps to identify each forums containing information are originally nested with the data they presented or not. It also helps to remove anonymous informative links from forum data that helps to avoid anonymous web usage and user timing on crawling the WebPages. The goal of systematic way of novel implementation deep Web learning using K-NN in the direction of real-time information with exclusive stage of implications. A focused online based information duplicate records crawler analyzes its move slowly boundary to find the hyperlinks that are in all likelihood to be maximum applicable for the move slowly, and avoids beside the point areas of the web. It identifies the next most important and relevant link to follow by counting on probabilistic models for correctly predicting the relevancy of the file. It can mine a group of duplicate records before selecting a value for an attribute of a normalized record. The overall performance of a focused Duplicate record web page crawling depends at the richness of links inside the specific subject matter being searched by using the user Based on this observation, the web forum crawling problem is reduced to a URL-type recognition problem. And shown how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that K-NN achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying K-NN on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.

Keywords


Web Learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers.