Open Access Open Access  Restricted Access Subscription Access

Clustering of Deep Webpages: A Comparative Study


Affiliations
1 Department of CSE, NIT, Trichy, India
2 Department of CSE, NIT Trichy-620015, Tamilnadu, India
 

The internethas massive amount of information. This information is stored in the form of zillions of webpages. The information that can be retrieved by search engines is huge, and this information constitutes the 'surface web'.But the remaining information, which is not indexed by search engines - the 'deep web', is much bigger in size than the 'surface web', and remains unexploited yet.

Several machine learning techniques have been commonly employed to access deep web content. Under machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A 'topic'is a cluster of words that frequently occur together and topic models can connect words with similar meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web databases employing several methods, and then perform a comparative study. In the first method, we apply Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web databases. Both these techniques are implemented after preprocessing the set of web pages to extract page contents and form contents. Further, we propose another version of Latent Dirichlet Allocation (LDA) to the dataset. Experimental results show that the proposed method outperforms the existing clustering methods.


Keywords

Latent Dirichlet Allocation, Latent Semantic Analysis, Deep Web, Cosine Similarity, Form Content and Page Content.
User
Notifications
Font Size

Abstract Views: 291

PDF Views: 154




  • Clustering of Deep Webpages: A Comparative Study

Abstract Views: 291  |  PDF Views: 154

Authors

C. Muhunthaadithya
Department of CSE, NIT, Trichy, India
J. V. Rohit
Department of CSE, NIT, Trichy, India
Sadhana Kesavan
Department of CSE, NIT, Trichy, India
E. Sivasankar
Department of CSE, NIT Trichy-620015, Tamilnadu, India

Abstract


The internethas massive amount of information. This information is stored in the form of zillions of webpages. The information that can be retrieved by search engines is huge, and this information constitutes the 'surface web'.But the remaining information, which is not indexed by search engines - the 'deep web', is much bigger in size than the 'surface web', and remains unexploited yet.

Several machine learning techniques have been commonly employed to access deep web content. Under machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A 'topic'is a cluster of words that frequently occur together and topic models can connect words with similar meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web databases employing several methods, and then perform a comparative study. In the first method, we apply Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web databases. Both these techniques are implemented after preprocessing the set of web pages to extract page contents and form contents. Further, we propose another version of Latent Dirichlet Allocation (LDA) to the dataset. Experimental results show that the proposed method outperforms the existing clustering methods.


Keywords


Latent Dirichlet Allocation, Latent Semantic Analysis, Deep Web, Cosine Similarity, Form Content and Page Content.