Open Access Open Access  Restricted Access Subscription Access

A Survey on Removal of Duplicate Records in Database


Affiliations
1 School of Computing (CSE), SASTRA University, 613401, Thanjavur, Tamilnadu, India
 

Deduplication is a task of identifying one or more records in repository that represents same object or entity. The problem is that the same data may be represented in different way in every database. While merging the databases, duplicates occur despite different schemas, writing styles or misspellings. They are called as replicas. Removing replicas from the repositories provides high quality information and saves processing time. This paper presents a thorough analysis of similarity metrics to identify similar fields in records and a set of algorithms and duplicate detection tools to detect and remove the replicas from the database.

Keywords

Similarity Metrics, Database, Indexing, Deduplication
User

  • Sarawagi S, and Bhamidipaty A (2002). Interactive deduplication using active learning, Proceedings of the eighth ACM SIGKDD International 1 Conference. Knowledge Discovery and Data Mining (KDD ’02), 269-278.
  • Elfeky M G, Elmagarmid A K et al. (2002). Proceedings of the International Conference on Data Engineering. (ICDE ’02), 17-28.
  • Yancey W E (2002). Bigmatch: A program for extracting probable matches from a large file for record linkage, Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C.
  • Bilenko M, and Mooney R J (2003). Adaptive duplicate detection using learnable string similarity measures, KDD ‘03 Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 39-48.
  • Elmagarmid A K, Ipeirotis P G et al. (2007). Duplicate record detection: a survey, IEEE Transactions on Knowledge and Data Engineering, vol 19(1), 1-16.
  • Freely extensible biomedical record linkage, 2007. Available from http://sourceforge.net/projects/febrl, vol 2011, no. 1, 1-16.
  • Storer M W, Greenan K et al. (2008). Secure data deduplication, Storage SS ‘08 Proceedings of the 4th ACM international workshop on Storage security and survivability, 1-10.
  • deCarvalho M G, Laender A H F et al. (2008). Replica Identification Using Genetic Programming, SAC ‘08 Proceedings of the 2008 ACM symposium on Applied computing, 1801-1806.
  • Wang W J, Lochovsky F H (2010). Record matching over query results from multiple web databases, IEEE Transactions on Knowledge and Data Engineering, vol 22(4), 578-588.
  • Khan B, Rauf A et al. (2011). Identification and removal of duplicated records, World Applied Sciences Journal, vol 13(5), 1178-1184.
  • Christen P (2011). A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, vol 24(9), 1537-1555.
  • deCarvalho M G, Alberto H F et al. (2013). A genetic programming approach to record deduplication, IEEE Transactions on Knowledge and Data Engineering, vol 24(3), 399-412.
  • Deepa K, and Rangarajan R (2012). Record deduplication using particle swarm optimization, European Journal of Scientific Research, vol 80(3), 366-378.
  • Subramaniya swamy V, and Pandian S C (2012). A complete survey of duplicate record detection using data mining techniques, Information Technology Journal, vol 11(8), 941-945.
  • Banu A F, and Chandrasekar C (2012). A survey on deduplication methods, International Journal of Computer Trends and Technology, vol 3(3), 364-368.
  • Shanmugavadivu P, and Baskar N (2012). An improved genetic programming based approach to deduplication using KFINDMR, International Journal of Computer Trends and Technology, vol 3(5), 694-701.

Abstract Views: 660

PDF Views: 145




  • A Survey on Removal of Duplicate Records in Database

Abstract Views: 660  |  PDF Views: 145

Authors

S. Krishna Anand
School of Computing (CSE), SASTRA University, 613401, Thanjavur, Tamilnadu, India

Abstract


Deduplication is a task of identifying one or more records in repository that represents same object or entity. The problem is that the same data may be represented in different way in every database. While merging the databases, duplicates occur despite different schemas, writing styles or misspellings. They are called as replicas. Removing replicas from the repositories provides high quality information and saves processing time. This paper presents a thorough analysis of similarity metrics to identify similar fields in records and a set of algorithms and duplicate detection tools to detect and remove the replicas from the database.

Keywords


Similarity Metrics, Database, Indexing, Deduplication

References





DOI: https://doi.org/10.17485/ijst%2F2013%2Fv6i4%2F31858