Open Access
Subscription Access
Open Access
Subscription Access
Automation of Template and Data Extraction from Dynamic Web Documents
Subscribe/Renew Journal
Many websites contain large set of pages generated using the common templates with contents. Due to the irrelevant terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically producing clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents and extract the data from clustered documents using TTTCR algorithm. Data extraction is a process of extracting the data out of data processing for further data processing. Thus, we don't need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving template and data extraction accuracy.
Keywords
Minimum Description Length (MDL), Template Extraction, Clustering, Template Table Text Chunk Removal (TTTCR).
User
Subscription
Login to verify subscription
Font Size
Information
Abstract Views: 274
PDF Views: 4