Open Access Open Access  Restricted Access Subscription Access

An Algorithm for Effective Web Crawling Mechanism of a Search Engine


Affiliations
1 Department of IST, K.L. College of Engineering, Vaddeswaram-522 502, India
2 Department of CS&SE, College of Engineering, Andhra University, Visakhapatnam, India
 

Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. Also even if there is good data collection that has been indexed we would be able to look at those sites having these info only if we are connected to the internet and may be the days where hourly based nets used to be major providers of internet are gone, these days the broadband facilities, high speed net connections are available to the common man. But, the growth in the usage of laptops is even growing at same pace, in that case one may not be able to access the net where ever he moves and if any important pages on the net in a particular web site would be of no use even he has good configuration, as still it takes time for the wi-fi networks to come in to full swing, until then saving every page of a particular website may be a hectic task. In this paper, would provide a framework for addressing the problem of browsing the web even when offline.

Keywords

Web Crawler, Search Engine, Indexer, Frontier, Crawl Manager.
User
Notifications
Font Size

Abstract Views: 231

PDF Views: 4




  • An Algorithm for Effective Web Crawling Mechanism of a Search Engine

Abstract Views: 231  |  PDF Views: 4

Authors

B. Vijaya Babu
Department of IST, K.L. College of Engineering, Vaddeswaram-522 502, India
M. Surendra Prasad Babu
Department of CS&SE, College of Engineering, Andhra University, Visakhapatnam, India
Y. Chetan Prasad
Department of IST, K.L. College of Engineering, Vaddeswaram-522 502, India

Abstract


Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. Also even if there is good data collection that has been indexed we would be able to look at those sites having these info only if we are connected to the internet and may be the days where hourly based nets used to be major providers of internet are gone, these days the broadband facilities, high speed net connections are available to the common man. But, the growth in the usage of laptops is even growing at same pace, in that case one may not be able to access the net where ever he moves and if any important pages on the net in a particular web site would be of no use even he has good configuration, as still it takes time for the wi-fi networks to come in to full swing, until then saving every page of a particular website may be a hectic task. In this paper, would provide a framework for addressing the problem of browsing the web even when offline.

Keywords


Web Crawler, Search Engine, Indexer, Frontier, Crawl Manager.