Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Identification of Duplicate Records Over Query Results from Real Time Web Databases


Affiliations
1 B. S. Abdur Rahman University, Chennai, India
2 SRM University, Chennai, India
     

   Subscribe/Renew Journal


Detecting database records that are approximate duplicates is an important task. A database having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Databases may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple databases. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time database scenario, where the records to match are query results dynamically generated on the fly. To address the problem of record matching in such database scenario, we present a Unsupervised Duplication Detection (UDD), for a given query the algorithm can effectively identify duplicates from the query result records of multiple databases. In the algorithm proposed, we start from the non duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from multiple databases.

Keywords

Record Matching, Duplication Detection, SVM, UDD.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 223

PDF Views: 2




  • Identification of Duplicate Records Over Query Results from Real Time Web Databases

Abstract Views: 223  |  PDF Views: 2

Authors

J. Aruna
B. S. Abdur Rahman University, Chennai, India
J. Jeysree
SRM University, Chennai, India

Abstract


Detecting database records that are approximate duplicates is an important task. A database having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Databases may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple databases. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time database scenario, where the records to match are query results dynamically generated on the fly. To address the problem of record matching in such database scenario, we present a Unsupervised Duplication Detection (UDD), for a given query the algorithm can effectively identify duplicates from the query result records of multiple databases. In the algorithm proposed, we start from the non duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from multiple databases.

Keywords


Record Matching, Duplication Detection, SVM, UDD.