Open Access Open Access  Restricted Access Subscription Access

A Comparative Study to Find a Suitable Method for Text Document Clustering


Affiliations
1 Department of Computer Science, P.S.G.R. Krishnammal College for Women, Coimbatore, India
2 Sri Ramakrishna College of Engineering, Coimbatore, India
 

Text mining is used in various text related tasks such as information extraction, concept/entity extraction, document summarization, entity relation modeling (i.e., learning relations between named entities), categorization/classification and clustering. This paper focuses on document clustering, a field of text mining, which groups a set of documents into a list of meaningful categories. The main focus of this paper is to present a performance analysis of various techniques available for document clustering. The results of this comparative study can be used to improve existing text data mining frameworks and improve the way of knowledge discovery. This paper considers six clustering techniques for document clustering. The techniques are grouped into three groups namely Group 1 - K-means and its variants (traditional K-means and K Means algorithms), Group 2 - Expectation Maximization and its variants (traditional EM, Spherical Gaussian EM algorithm and Linear Partitioning and Reallocation clustering (LPR) using EM algorithms), Group 3 - Semantic-based techniques (Hybrid method and Feature-based algorithms). A total of seven algorithms are considered and were selected based on their popularity in the text mining field. Several experiments were conducted to analyze the performance of the algorithm and to select the winner in terms of cluster purity, clustering accuracy and speed of clustering.

Keywords

Text Mining, Traditional K-Means, Traditional Em Algorithm, sGEM, HSTC Model, TCFS Method.
User
Notifications
Font Size

Abstract Views: 557

PDF Views: 207




  • A Comparative Study to Find a Suitable Method for Text Document Clustering

Abstract Views: 557  |  PDF Views: 207

Authors

S. C. Punitha
Department of Computer Science, P.S.G.R. Krishnammal College for Women, Coimbatore, India
M. Punithavalli
Sri Ramakrishna College of Engineering, Coimbatore, India

Abstract


Text mining is used in various text related tasks such as information extraction, concept/entity extraction, document summarization, entity relation modeling (i.e., learning relations between named entities), categorization/classification and clustering. This paper focuses on document clustering, a field of text mining, which groups a set of documents into a list of meaningful categories. The main focus of this paper is to present a performance analysis of various techniques available for document clustering. The results of this comparative study can be used to improve existing text data mining frameworks and improve the way of knowledge discovery. This paper considers six clustering techniques for document clustering. The techniques are grouped into three groups namely Group 1 - K-means and its variants (traditional K-means and K Means algorithms), Group 2 - Expectation Maximization and its variants (traditional EM, Spherical Gaussian EM algorithm and Linear Partitioning and Reallocation clustering (LPR) using EM algorithms), Group 3 - Semantic-based techniques (Hybrid method and Feature-based algorithms). A total of seven algorithms are considered and were selected based on their popularity in the text mining field. Several experiments were conducted to analyze the performance of the algorithm and to select the winner in terms of cluster purity, clustering accuracy and speed of clustering.

Keywords


Text Mining, Traditional K-Means, Traditional Em Algorithm, sGEM, HSTC Model, TCFS Method.