Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Crawling the Web at Desktop Balance Inspection


Affiliations
1 Dept. of IT, Gurunank Engineering College, NBA Accredited College, Ibrahimpatnam, Hyderabad, Ranga Reddy Dist., India
2 Department of Information Technology, JNTU University, Gnec, Hyderabad-501510, Andhra Pradesh, India
     

   Subscribe/Renew Journal


Crawler is a hypertext resource discovery system whose goal is to selectively seek out pages that are relevant to a pre-defined set of topics. The inherent characteristics of focused crawling, personalization and low resource needs, naturally lends to its usage by individuals. Current focused crawlers depend on a classifier that scores each crawled document with respect to a predefined set of topics. In real applications. Today finding information on the web is an arduous task that will only get worse with the exponential increase of content. To deal with this issue, search engines, web sites providing users with the ability to query for web pages of interest, have become the utility of choice. A simple search against any given engine can yield thousands of results. This presents a huge problem as one could spend a majority of her time just paging through the results. More often than not, one must visit the referenced page to see if the result is actually what she was looking for. This can be attributed to two important facts. On the one hand, the result page's link metadata, i. e. the description of the page returned by the search engine, is not very descriptive. On the other hand, the result page itself is not guaranteed to be what the user is looking for. Once a page is found that matches the user expectations, the search engine moves out of the picture and it is up to the user to continue mining the needed information. This can be a cumbersome task as pages on the web have a plethora of links and an abundance of content. Having to filter through thousands of references is not a very rewarding search process considering that there is no way to be certain that the search will be complete, or always yield the web page the user was originally interested in. These problems must be addressed in order to increase the effectiveness of search engines. We believe that the aforementioned problems could be addressed if the number of links the user had to visit in order to find the desired web page could be reduced. First, we define the following term which will be used throughout the rest of the paper and in our assertion below. A concept is one overarching idea or topic present in a web page. Put simply, our assertion is: if the concepts of underlying web page results being searched can be presented to the user automatically then the amount of links the user will need to visit to find her desired result will be lessened. Essentially this means automatically discovering the set of concepts that describe a web page. This describes and evaluates a method for automatically extracting the concepts from web pages returned via heterogeneous search engines including Google, MSN Search, Yahoo Search, AltaVista Search and Ask Jeeves Search. Along with regular concepts, our method also extracts complex concepts.

Keywords

Crawling, Crawlers, Focused Crawler, Query, Track, User Queries.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 262

PDF Views: 2




  • Crawling the Web at Desktop Balance Inspection

Abstract Views: 262  |  PDF Views: 2

Authors

N. Prasanna Balaji
Dept. of IT, Gurunank Engineering College, NBA Accredited College, Ibrahimpatnam, Hyderabad, Ranga Reddy Dist., India
Pingili Madhavi
Department of Information Technology, JNTU University, Gnec, Hyderabad-501510, Andhra Pradesh, India

Abstract


Crawler is a hypertext resource discovery system whose goal is to selectively seek out pages that are relevant to a pre-defined set of topics. The inherent characteristics of focused crawling, personalization and low resource needs, naturally lends to its usage by individuals. Current focused crawlers depend on a classifier that scores each crawled document with respect to a predefined set of topics. In real applications. Today finding information on the web is an arduous task that will only get worse with the exponential increase of content. To deal with this issue, search engines, web sites providing users with the ability to query for web pages of interest, have become the utility of choice. A simple search against any given engine can yield thousands of results. This presents a huge problem as one could spend a majority of her time just paging through the results. More often than not, one must visit the referenced page to see if the result is actually what she was looking for. This can be attributed to two important facts. On the one hand, the result page's link metadata, i. e. the description of the page returned by the search engine, is not very descriptive. On the other hand, the result page itself is not guaranteed to be what the user is looking for. Once a page is found that matches the user expectations, the search engine moves out of the picture and it is up to the user to continue mining the needed information. This can be a cumbersome task as pages on the web have a plethora of links and an abundance of content. Having to filter through thousands of references is not a very rewarding search process considering that there is no way to be certain that the search will be complete, or always yield the web page the user was originally interested in. These problems must be addressed in order to increase the effectiveness of search engines. We believe that the aforementioned problems could be addressed if the number of links the user had to visit in order to find the desired web page could be reduced. First, we define the following term which will be used throughout the rest of the paper and in our assertion below. A concept is one overarching idea or topic present in a web page. Put simply, our assertion is: if the concepts of underlying web page results being searched can be presented to the user automatically then the amount of links the user will need to visit to find her desired result will be lessened. Essentially this means automatically discovering the set of concepts that describe a web page. This describes and evaluates a method for automatically extracting the concepts from web pages returned via heterogeneous search engines including Google, MSN Search, Yahoo Search, AltaVista Search and Ask Jeeves Search. Along with regular concepts, our method also extracts complex concepts.

Keywords


Crawling, Crawlers, Focused Crawler, Query, Track, User Queries.