Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Automation of Template and Data Extraction from Dynamic Web Documents


Affiliations
1 Computer Science and Engineering, PSNA College of Engineering and Technology, Dindigul, India
2 Department of Computer Science and Engineering, PSNA College of Engineering and Technology, Dindigul, India
3 Department of Information Technology, PSNA College of Engineering and Technology, Dindigul, India
     

   Subscribe/Renew Journal


Many websites contain large set of pages generated using the common templates with contents. Due to the irrelevant terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically producing clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents and extract the data from clustered documents using TTTCR algorithm. Data extraction is a process of extracting the data out of data processing for further data processing. Thus, we don't need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving template and data extraction accuracy.

Keywords

Minimum Description Length (MDL), Template Extraction, Clustering, Template Table Text Chunk Removal (TTTCR).
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 273

PDF Views: 4




  • Automation of Template and Data Extraction from Dynamic Web Documents

Abstract Views: 273  |  PDF Views: 4

Authors

S. Pradeepa
Computer Science and Engineering, PSNA College of Engineering and Technology, Dindigul, India
K. Satheesbabu
Department of Computer Science and Engineering, PSNA College of Engineering and Technology, Dindigul, India
K. Sabeetha
Department of Information Technology, PSNA College of Engineering and Technology, Dindigul, India

Abstract


Many websites contain large set of pages generated using the common templates with contents. Due to the irrelevant terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically producing clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents and extract the data from clustered documents using TTTCR algorithm. Data extraction is a process of extracting the data out of data processing for further data processing. Thus, we don't need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving template and data extraction accuracy.

Keywords


Minimum Description Length (MDL), Template Extraction, Clustering, Template Table Text Chunk Removal (TTTCR).