Open Access
Subscription Access
An Algorithm for Effective Web Crawling Mechanism of a Search Engine
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. Also even if there is good data collection that has been indexed we would be able to look at those sites having these info only if we are connected to the internet and may be the days where hourly based nets used to be major providers of internet are gone, these days the broadband facilities, high speed net connections are available to the common man. But, the growth in the usage of laptops is even growing at same pace, in that case one may not be able to access the net where ever he moves and if any important pages on the net in a particular web site would be of no use even he has good configuration, as still it takes time for the wi-fi networks to come in to full swing, until then saving every page of a particular website may be a hectic task. In this paper, would provide a framework for addressing the problem of browsing the web even when offline.
Keywords
Web Crawler, Search Engine, Indexer, Frontier, Crawl Manager.
User
Font Size
Information
Abstract Views: 231
PDF Views: 4