Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Improving Effectiveness in Large Scale Data by Concentrating Deduplication with Adaboost Algorithm


     

   Subscribe/Renew Journal


The deduplication process refers to finding reports which are duplicate or copied data by comparing one or more information base or data sets. The process of matching records from numerous records is named as record linkage.  The output of the deduplication process contains coordinated data which enclose important useable information in sequence. This information is very costly to obtain and due to this deduplication process is gaining attention day by day. In duplication procedure the cleaning process which eliminates copied or duplicate data from a single database is a difficult step as its outcomes succeeding data indulgence or data mining may get greatly influenced by the duplicates. The catalog extent is increasing day by day so the the identical process difficulty is becoming one of the major tackle for record linkage and for deduplication. In order to overcome this issue to some extent we propose a T3S that is Two Stage Sampling Selection model. In T3S there are two stages. The first stage is to produce balanced subset candidate pairs which are need to be ticket.  The second stage is to produce smaller and more informative guidance sets when comparing with first stage and an active assortment is call upon incrementally in order to eliminate the redundant pairs that are created in first stage. By using mnemonic names in second stage the duplicate files are identified.  In classification phase we extend our work by using Adaboost algorithm which is an effective classification approach.  A number of studies revealed that Adaboost gives better accuracy when comparing with SVM classifier. The outcome of our proposed approach on real world dataset will show the proportional analysis of both the methods, which shows that our planned approach and methods, gives better results when compared to SVM.


Keywords

Adaboost, Deduplication, Hashing Algorithm, SVM Classifier, Tokenization, T3S.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 261

PDF Views: 0




  • Improving Effectiveness in Large Scale Data by Concentrating Deduplication with Adaboost Algorithm

Abstract Views: 261  |  PDF Views: 0

Authors

Abstract


The deduplication process refers to finding reports which are duplicate or copied data by comparing one or more information base or data sets. The process of matching records from numerous records is named as record linkage.  The output of the deduplication process contains coordinated data which enclose important useable information in sequence. This information is very costly to obtain and due to this deduplication process is gaining attention day by day. In duplication procedure the cleaning process which eliminates copied or duplicate data from a single database is a difficult step as its outcomes succeeding data indulgence or data mining may get greatly influenced by the duplicates. The catalog extent is increasing day by day so the the identical process difficulty is becoming one of the major tackle for record linkage and for deduplication. In order to overcome this issue to some extent we propose a T3S that is Two Stage Sampling Selection model. In T3S there are two stages. The first stage is to produce balanced subset candidate pairs which are need to be ticket.  The second stage is to produce smaller and more informative guidance sets when comparing with first stage and an active assortment is call upon incrementally in order to eliminate the redundant pairs that are created in first stage. By using mnemonic names in second stage the duplicate files are identified.  In classification phase we extend our work by using Adaboost algorithm which is an effective classification approach.  A number of studies revealed that Adaboost gives better accuracy when comparing with SVM classifier. The outcome of our proposed approach on real world dataset will show the proportional analysis of both the methods, which shows that our planned approach and methods, gives better results when compared to SVM.


Keywords


Adaboost, Deduplication, Hashing Algorithm, SVM Classifier, Tokenization, T3S.