Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Component Based Effective Web Crawler and Indexer Using Web Services


Affiliations
1 Multimedia Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamilnadu, India
     

   Subscribe/Renew Journal


Designing and developing an effective web crawler is a challenging role in a large search engine. This paper proposes component based web crawler along with the indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The communication between the services is sent and received using XML, SOAP and WSDL. In the crawler service, the web pages are fetched and parsed for retrieving all the hyperlinks. The process is carried out recursively using Breadth-First strategy. The extracted URLs are downloaded and those web pages are sent to the indexer services by passing the message. In the indexer service, HTML pages are parsed, stop words are removed, stemming of keywords are carried out as pre-processing steps and the result is stored in the form of inverted index. We have evaluated the performance of the proposed design specification of the crawler with indexer and found that the number of pages retrieved is notably on the higher side.

Keywords

Inverted Index, Tokenization, URL, Web Crawler, Web Service.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 143

PDF Views: 4




  • Component Based Effective Web Crawler and Indexer Using Web Services

Abstract Views: 143  |  PDF Views: 4

Authors

A. Vadivel
Multimedia Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamilnadu, India
S. G. Shaila
Multimedia Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamilnadu, India
R. Devi Mahalakshmi
Multimedia Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamilnadu, India
J. Karthika
Multimedia Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamilnadu, India

Abstract


Designing and developing an effective web crawler is a challenging role in a large search engine. This paper proposes component based web crawler along with the indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The communication between the services is sent and received using XML, SOAP and WSDL. In the crawler service, the web pages are fetched and parsed for retrieving all the hyperlinks. The process is carried out recursively using Breadth-First strategy. The extracted URLs are downloaded and those web pages are sent to the indexer services by passing the message. In the indexer service, HTML pages are parsed, stop words are removed, stemming of keywords are carried out as pre-processing steps and the result is stored in the form of inverted index. We have evaluated the performance of the proposed design specification of the crawler with indexer and found that the number of pages retrieved is notably on the higher side.

Keywords


Inverted Index, Tokenization, URL, Web Crawler, Web Service.