A Survey on Removal of Duplicate Records in Database

S. Krishna Anand

doi:10.17485/ijst/2013/v6i4/31858

A Survey on Removal of Duplicate Records in Database

S. Krishna Anand

Affiliations
1 School of Computing (CSE), SASTRA University, 613401, Thanjavur, Tamilnadu, India

Abstract
References
Article Metrics
Refbacks

Deduplication is a task of identifying one or more records in repository that represents same object or entity. The problem is that the same data may be represented in different way in every database. While merging the databases, duplicates occur despite different schemas, writing styles or misspellings. They are called as replicas. Removing replicas from the repositories provides high quality information and saves processing time. This paper presents a thorough analysis of similarity metrics to identify similar fields in records and a set of algorithms and duplicate detection tools to detect and remove the replicas from the database.

Keywords

Similarity Metrics, Database, Indexing, Deduplication

About the Journal

Editorial Board

Current Issue

Archives

Advanced Search

Article Submission

Registration

Subscription

User

Information

Journal Content
Browse

Donations

Sarawagi S, and Bhamidipaty A (2002). Interactive deduplication using active learning, Proceedings of the eighth ACM SIGKDD International 1 Conference. Knowledge Discovery and Data Mining (KDD ’02), 269-278.

Elfeky M G, Elmagarmid A K et al. (2002). Proceedings of the International Conference on Data Engineering. (ICDE ’02), 17-28.

Yancey W E (2002). Bigmatch: A program for extracting probable matches from a large file for record linkage, Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C.

Bilenko M, and Mooney R J (2003). Adaptive duplicate detection using learnable string similarity measures, KDD ‘03 Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 39-48.

Elmagarmid A K, Ipeirotis P G et al. (2007). Duplicate record detection: a survey, IEEE Transactions on Knowledge and Data Engineering, vol 19(1), 1-16.

Freely extensible biomedical record linkage, 2007. Available from http://sourceforge.net/projects/febrl, vol 2011, no. 1, 1-16.

Storer M W, Greenan K et al. (2008). Secure data deduplication, Storage SS ‘08 Proceedings of the 4th ACM international workshop on Storage security and survivability, 1-10.

deCarvalho M G, Laender A H F et al. (2008). Replica Identification Using Genetic Programming, SAC ‘08 Proceedings of the 2008 ACM symposium on Applied computing, 1801-1806.

Wang W J, Lochovsky F H (2010). Record matching over query results from multiple web databases, IEEE Transactions on Knowledge and Data Engineering, vol 22(4), 578-588.

Khan B, Rauf A et al. (2011). Identification and removal of duplicated records, World Applied Sciences Journal, vol 13(5), 1178-1184.

Christen P (2011). A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, vol 24(9), 1537-1555.

deCarvalho M G, Alberto H F et al. (2013). A genetic programming approach to record deduplication, IEEE Transactions on Knowledge and Data Engineering, vol 24(3), 399-412.

Deepa K, and Rangarajan R (2012). Record deduplication using particle swarm optimization, European Journal of Scientific Research, vol 80(3), 366-378.

Subramaniya swamy V, and Pandian S C (2012). A complete survey of duplicate record detection using data mining techniques, Information Technology Journal, vol 11(8), 941-945.

Banu A F, and Chandrasekar C (2012). A survey on deduplication methods, International Journal of Computer Trends and Technology, vol 3(3), 364-368.

Shanmugavadivu P, and Baskar N (2012). An improved genetic programming based approach to deduplication using KFINDMR, International Journal of Computer Trends and Technology, vol 3(5), 694-701.

Abstract Views: 660

PDF Views: 145

Username
Password
Remember me

Username
Password
Remember me

Indian Journal of Science and Technology

A Survey on Removal of Duplicate Records in Database

Keywords

A Survey on Removal of Duplicate Records in Database

Authors

Abstract

Keywords

References