Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

An Efficient Web Content Mining Using Divide and Conquer Approach


Affiliations
1 Department of C.S.E., Sreenivasa Institute of Technology and Management Studies, Chittoor, Andhra Pradesh, India
2 Department of Computer Science & Engineering, Bundelkhand University, Jhansi, Uttar Pradesh, India
     

   Subscribe/Renew Journal


In Context of web mining, large collection of web documents are used in the process of mining to extract more useful information. Most of the web information is irrelevant. Web document presents 10-15% of data using 85-90% of tags. The previous researchers on web mining proposed many methods, for mining web documents, but all these mining methods process documents without consider size of the document. Let N Documents are used in the mining and size of the each document is different. It is a time consuming process. In this thesis we proposed new web mining method called Web Mining using Divide and Conquer Approach (WMDCA). It consists of four phases: document selection phase (list of documents selected), Preprocessing phase (Divide big size documents, cleaning of each document, combine all sub documents to create XML cube), web mining phase (apply our algorithm to identify patterns), presentation phase (presentation of discovered results). Experiments are conducted on various web documents that are related to one domain. Experimental results of proposed system produce patterns with less time compare with existing methods on web document mining.

Keywords

Divide and Conquer Method, Document Cleaning, Filtering, EFKS Algorithm, Web Mining
Subscription Login to verify subscription
User
Notifications
Font Size


  • Hany Mahgoub, Dietmar Rösner, Nabil Ismail and Fawzy Torkey, “A Text Mining Technique Using Association Rules Extraction”, World Academy of Science, Engineering, and Technology, 2008.
  • Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van Someren, Myra Spiliopoulou, “A Roadmap for Web Mining: From Web to Semantic Web”, Springer, 2005.
  • R. Kosala and H. Blockeel, “Web Mining Research: A Survey”, SIGKDD Explorations, 1-15, 2000.
  • G. Poonkuzhali, K.Thiagarajan, and K.Sarukesi, “Elimination of Redundant Links in Web Pages- Mathematical Approach”, World Academy of Science, Engineering and Technology, 562-565, 2009.
  • M. Giri, “A survey Paper on Web Mining”, IKON Books publishers, ISBN: 978-81-908497-9-1, 2011.
  • R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. conf. of very Large Data Bases, VLDB, Santigo, Chile, 1994, 487-499.
  • H. Ahonen, O. Heinonen, M. Klemettinen, and A. Inkeri Verkamo, “Applying data mining technique for descriptive phrase extraction in digital document collections,” in Proc. of IEEE Forum on Research and technology Advances in Digital Libraries, Santa Barbra CA, 1998.
  • H. Mahgoub,”Mining association rules from unstructured documents” in Proc. 3rd Int. Conf. on Knowledge Mining, ICKM, Prague, Czech Republic, Aug. 25- 27, 2006, pp. 167-172.
  • H. Mannila, H. Toivonen and A. I. Verkamo, “Discovery of frequent episodes in event sequences,” Data Mining and Knowledge Discovery, 1(3), 1997b, pp. 259-289.
  • J. Paralic and P. Bednar, “Text mining for documents annotation and ontology support (A book chapter in: "intelligent systems at serviceof Mankind,” ISBN 3-935798-25-3, Ubooks, Germany, 2003).
  • X. Chen and Y. Wu, “Personalized knowledge discovery: mining novel association rules from text” Available: www.siam.org/meetings/sdm06/proceedings/067chenx.pdf
  • Nahm, U.Y., Bilenko, M. and Mooney R.J. Two Approaches to Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002
  • Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002.
  • Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002.
  • Davision, B.D. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000.
  • Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, (1) 1, 1999.
  • Jushmerick, N. Learning to remove Internet advertisements, AGENT-99, 1999.
  • Lee, M.L., Ling, W. and Low, W.L. Intelliclean: A knowledge-based intelligent data cleaner. KDD-2000, 2000.
  • Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. ICML-97, 1997.

Abstract Views: 356

PDF Views: 0




  • An Efficient Web Content Mining Using Divide and Conquer Approach

Abstract Views: 356  |  PDF Views: 0

Authors

M. Giri
Department of C.S.E., Sreenivasa Institute of Technology and Management Studies, Chittoor, Andhra Pradesh, India
Akash Kumar
Department of Computer Science & Engineering, Bundelkhand University, Jhansi, Uttar Pradesh, India

Abstract


In Context of web mining, large collection of web documents are used in the process of mining to extract more useful information. Most of the web information is irrelevant. Web document presents 10-15% of data using 85-90% of tags. The previous researchers on web mining proposed many methods, for mining web documents, but all these mining methods process documents without consider size of the document. Let N Documents are used in the mining and size of the each document is different. It is a time consuming process. In this thesis we proposed new web mining method called Web Mining using Divide and Conquer Approach (WMDCA). It consists of four phases: document selection phase (list of documents selected), Preprocessing phase (Divide big size documents, cleaning of each document, combine all sub documents to create XML cube), web mining phase (apply our algorithm to identify patterns), presentation phase (presentation of discovered results). Experiments are conducted on various web documents that are related to one domain. Experimental results of proposed system produce patterns with less time compare with existing methods on web document mining.

Keywords


Divide and Conquer Method, Document Cleaning, Filtering, EFKS Algorithm, Web Mining

References