Open Access Open Access  Restricted Access Subscription Access

Web Data Extraction and Alignment Tools: A Survey


Affiliations
1 Department of Computer Engineering, Pune Institute of Computer Engineering, Pune, India
 

Search engine generates the dynamic result page when user submits a query. Result page consists of query-relevant data along with some auxiliary information such as advertisement, navigation panels. Decision making regarding which part of this web page has main content is easy for human but tough for computer programs. So in order to utilize this data, it is necessary to remove irrelevant data and automatically extract data from those result pages. Further extracted data can be aligned in structured format like table for comparison.

This paper deals with the study of various automatic web data extraction and data alignment techniques. Web data extraction techniques are mainly classified as Wrapper programming languages, Wrapper induction and Automatic extraction. For data alignment, some techniques rely only on structure of html tags or on both tag and data values.


Keywords

Data Extraction, Wrapper Induction, Dom Tree, Web Crawler, Data Alignment.
User
Notifications
Font Size

Abstract Views: 137

PDF Views: 0




  • Web Data Extraction and Alignment Tools: A Survey

Abstract Views: 137  |  PDF Views: 0

Authors

Shridevi A. Swami
Department of Computer Engineering, Pune Institute of Computer Engineering, Pune, India
Pujashree Vidap
Department of Computer Engineering, Pune Institute of Computer Engineering, Pune, India

Abstract


Search engine generates the dynamic result page when user submits a query. Result page consists of query-relevant data along with some auxiliary information such as advertisement, navigation panels. Decision making regarding which part of this web page has main content is easy for human but tough for computer programs. So in order to utilize this data, it is necessary to remove irrelevant data and automatically extract data from those result pages. Further extracted data can be aligned in structured format like table for comparison.

This paper deals with the study of various automatic web data extraction and data alignment techniques. Web data extraction techniques are mainly classified as Wrapper programming languages, Wrapper induction and Automatic extraction. For data alignment, some techniques rely only on structure of html tags or on both tag and data values.


Keywords


Data Extraction, Wrapper Induction, Dom Tree, Web Crawler, Data Alignment.