Web Forum Crawling Using Index Thread Page Flipping Algorithm
Subscribe/Renew Journal
Internet forums are important platforms where users can send request and exchange information from different sources. The issue in existing system is the URL type recognition problem which consists of duplicate links and uninformative pages. Index Thread Page Flipping Algorithm (ITF) is used to overcome this issue. URL layout and page layout are used to recognise whether the URL link is valid or invalid.
In this project (Phase-I), "Web Forum Crawling using Index Thread Page Flipping Algorithm" is provided that finds whether the links are valid or invalid. The goal is to crawl relevant content. The Internet forums will have the URL type recognition problem. It learns to get the correct path or URL by using regular expression patterns and with created training sets from page type classifiers.
The modules implemented are user interface design module, page flipping module, entry URL discovery module, index/thread URL detection module, generic crawler module. In the user interface design module to connect with a server, user must give their user name and password. In the page flipping module, a long forum is divided into more pages which are linked by page-flipping links.Generic crawlers process each page individually and ignore the relationships between such pages. In the entry URL discovery module entry URL should be specified to perform the process. Some rules are defined to find the entry URL. In the index and thread URL detection module, index URL and thread URL are identified by their URL pattern. In the generic crawler module, given a forum it enters into the thread page and it performs crawling where it avoids the duplicate links and page flipping links.
The front end for all the modules in the project (Phase-I) is designed using eclipse and the backend is designed using SQL server 2005. The two modules in the project (Phase-I) are implemented using Java Servlet, JSP and the code behind is written using Java. The main feature of this project (Phase-I) is to save the bandwidth and time.
Keywords
- Cai, R., Yang, J. M., Lai, W., Wang, Y., & Zhang, L. (2008). iRobot: An intelligent crawler for web forums. Proceedings of the 17th International Conference on World Wide Web (pp. 447-456).
- Dasgupta, A., Kumar, R., & Sasturkar, A. (2008). Deduping URLs via rewrite rules. Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 186-194).
- Manku, G. S. , Jain, A., & Sarma, A. D. (2009). Detecting near duplicates for web crawling.
- Gao, C., Wang, L.,. Lin, C. Y., & Song, Y. I. Finding question- answer pairs from online forums. Proceedings of 31st Annual International ACM SIGIR Conference Research and Development in Information Retrieval (pp. 467-474).
- Guo, Y., Li, K., Zhang, K., & Zhang, G. (2006). Board forum crawling: A web crawling method for web forum. Proceedings of 2006 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 475-478).
- Henzinger, M. (2006). Finding near-duplicate Web pages: a large-scale evaluation of algorithms. Proceedings of 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 284-29)1.
- Koppula, H. S., Leela, K. P. , Agarwal, A., Chitrapura, K. P., Garg, S., & Sasturkar, A. (2010). Learning URL patterns for webpage de-duplication. Proceedings of the 3rd ACM Conference on Web Search and Data Mining (pp. 381-390).
- Schonfeld, U., & Shivakumar, N. (2009). Sitemaps: Above and Beyond the Crawl of Duty. Proceedings of 18th International Conference World Wide Web (pp. 991-1000).
- Yang, J. M., Cai, R., Wang, Y., Zhu, J., Zhang, L., & Ma, W.Y. (2009). Incorporating Site-Level Knowledge to Extract Structured data from web forums. Proceeding of the 18th International Conference On World Wide Web, 181-190.
- Zhang, L., Liu, B., Lim, S. H., & O'Brien-Strain, E. (2010). Extracting and Ranking Product Features in Opinion Documents.
Abstract Views: 834
PDF Views: 2