Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

An Optimized Approach to Record Deduplication


Affiliations
1 Department of Computer Science, R.V.S College of Arts and Science, Sulur, Coimbatore, India
     

   Subscribe/Renew Journal


Record deduplication is a specialized technique for eliminating duplicate copies of repeating record. Duplicate record detection is important for data preprocessing and cleaning. The increasing volume of information available in digital media becomes a challenging problem for data administrators. The increased volume even created redundant data also in the database. So a system or method is become immense to control the redundancy and duplication. Databases are increasing in size at an exponential rate, and it plays an important role in all industry. Detection of duplicate Records in IT industry become is necessary to obtain precise results while searching and to shrink storage requirements. This paper presents the problem of duplicate records and their detection. In the proposed approach, we made a method that makes use of BAT for generating the optimal similarity measure to decide whether the data is duplicate or not. The optimal similarity measure is generated using BAT algorithm for the training datasets. This system is initialized with a population of random solutions and searches for optima by updating bat generations We have used Synthetic datasets to analyze the proposed algorithm and the performance of the proposed algorithm is compared against the genetic programming technique with the help of evaluation metrics. Our Approach makes the user free from the burden of having to choose and tune this parameter.

Keywords

BAT Algorithm Data Preprocessing, Duplicate Detection, Data Duplication, Genetic Programming.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 245

PDF Views: 2




  • An Optimized Approach to Record Deduplication

Abstract Views: 245  |  PDF Views: 2

Authors

V. Nirmala
Department of Computer Science, R.V.S College of Arts and Science, Sulur, Coimbatore, India
B. Rosiline Jeetha
Department of Computer Science, R.V.S College of Arts and Science, Sulur, Coimbatore, India

Abstract


Record deduplication is a specialized technique for eliminating duplicate copies of repeating record. Duplicate record detection is important for data preprocessing and cleaning. The increasing volume of information available in digital media becomes a challenging problem for data administrators. The increased volume even created redundant data also in the database. So a system or method is become immense to control the redundancy and duplication. Databases are increasing in size at an exponential rate, and it plays an important role in all industry. Detection of duplicate Records in IT industry become is necessary to obtain precise results while searching and to shrink storage requirements. This paper presents the problem of duplicate records and their detection. In the proposed approach, we made a method that makes use of BAT for generating the optimal similarity measure to decide whether the data is duplicate or not. The optimal similarity measure is generated using BAT algorithm for the training datasets. This system is initialized with a population of random solutions and searches for optima by updating bat generations We have used Synthetic datasets to analyze the proposed algorithm and the performance of the proposed algorithm is compared against the genetic programming technique with the help of evaluation metrics. Our Approach makes the user free from the burden of having to choose and tune this parameter.

Keywords


BAT Algorithm Data Preprocessing, Duplicate Detection, Data Duplication, Genetic Programming.