Open Access
Subscription Access
Open Access
Subscription Access
Identification of Duplicate Records Over Query Results from Real Time Web Databases
Subscribe/Renew Journal
Detecting database records that are approximate duplicates is an important task. A database having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Databases may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple databases. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time database scenario, where the records to match are query results dynamically generated on the fly. To address the problem of record matching in such database scenario, we present a Unsupervised Duplication Detection (UDD), for a given query the algorithm can effectively identify duplicates from the query result records of multiple databases. In the algorithm proposed, we start from the non duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from multiple databases.
Keywords
Record Matching, Duplication Detection, SVM, UDD.
User
Subscription
Login to verify subscription
Font Size
Information
Abstract Views: 261
PDF Views: 2