Open Access Open Access  Restricted Access Subscription Access

Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree


Affiliations
1 Datta Meghe Institute of Engineering, Technology and Research, Sawangi, Wardha, India
 

A webpage generally contains data along with navigation panels, advertisements, copyright and privacy notices. Except data these other things do not contain any important information. These blocks can be called as non-informative blocks. As these blocks are non-informative, they can affect the result of web data mining. To avoid this, it is important to separate the main data i.e. informative blocks and non-informative blocks from the web page. In a website these non-informative blocks are generally present in different web pages and have same format. Also, the data contained in these blocks is also same. In case of informative blocks, data contained by the block and their format are different. We need a structure at site level to capture the same format of the blocks and the data present in the blocks. DOM Tree structure is available at page level. Many tools are available to construct a DOM Tree of a webpage. But DOM Tree structure is not useful at site level. So, we need to construct a Site Style Tree (SST) for a website. After analyzing this SST, we can identify which part of SST is informative and which is non-informative. There is no tool available to construct a style tree for a given website. This work aims at constructing a style tree for given website and separating informative and non-informative blocks from the website.
User
Notifications
Font Size

  • R. Gunasundari, S. Karthikeyan, “Removing Non-informative Blocks from the Web Pages”, Communication Control and Computing Technologies (ICCCCT), 2010 IEEE International Conference.
  • B. Liu, K. Zhao, and L. Yi, “Eliminating Noisy Information in Web Pages for Data Mining”, Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 296-305, 2003.
  • Bar-Yossef, Z. and Rajagopalan, S., "Template Detection via Data Mining and its Applications", 2002.
  • Shian-Hua Lin and Jan-Ming Ho., "Discovering Informative Content Blocks from Web Documents", KDD-02, 2002.
  • S. Debnath, P. Mitra, and C.L. Giles, N.Pal “Automatic Identification of informative sections of Web Pages", IEEE Transaction on Knowledge and Data Engineering , 2005.
  • Chia-Hsin Huang, Po-Yi Yen, Yi-Chan Hung, Tyng-Ruey Chuang, and Hahn-Ming Lee, "Enhancing Entropy-based Informative Block Identification Using Block Preclustering Technology", IEEE International Conference on Systems, Man, and Cybernetics, October 8-11, 2006, Taipei, Taiwan.
  • Hung-Yu Kao, Jan-Ming Ho, Member, IEEE, and Ming-Syan Chen, Fellow, IEEE, "WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model", IEEE transaction on Knowledge and Data Engineering, VOL. 17, NO. 5, MAY 2005.
  • Jushmerick, N., "Learning to remove Internet advertisements", AGENT- 99, 1999.
  • Yao, Z. and Choi, B., 2007. “Clustering Web Pages into Hierarchical Categories,” International Journal of Intelligent Information Technologies, Special Issue on Web Mining, Vol. 3, No. 2, pp.17-35.

Abstract Views: 184

PDF Views: 96




  • Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree

Abstract Views: 184  |  PDF Views: 96

Authors

Rakesh M. Kohale, Shreyash G. Balbudhe
Datta Meghe Institute of Engineering, Technology and Research, Sawangi, Wardha, India

Abstract


A webpage generally contains data along with navigation panels, advertisements, copyright and privacy notices. Except data these other things do not contain any important information. These blocks can be called as non-informative blocks. As these blocks are non-informative, they can affect the result of web data mining. To avoid this, it is important to separate the main data i.e. informative blocks and non-informative blocks from the web page. In a website these non-informative blocks are generally present in different web pages and have same format. Also, the data contained in these blocks is also same. In case of informative blocks, data contained by the block and their format are different. We need a structure at site level to capture the same format of the blocks and the data present in the blocks. DOM Tree structure is available at page level. Many tools are available to construct a DOM Tree of a webpage. But DOM Tree structure is not useful at site level. So, we need to construct a Site Style Tree (SST) for a website. After analyzing this SST, we can identify which part of SST is informative and which is non-informative. There is no tool available to construct a style tree for a given website. This work aims at constructing a style tree for given website and separating informative and non-informative blocks from the website.

References