Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Efficient and Effortless Similarity Measures for Cluster Ensembles


Affiliations
1 Dept. of CSE, Dr. MGR University, Chennai, India
2 Department of Information Science and Engineering, PESIT, Bangalore, India
     

   Subscribe/Renew Journal


Spatial data mining basically deals with the discovery of implicit knowledge in spatial data. With the tremendous rise in the accumulation of spatial data, new approaches in spatial data mining has become is an critical requirement. With so many clustering algorithms and their derivatives available,  and also the success stories of bagging and boosting in classification, has brought the area of cluster ensembles to limelight in the last decade.  There are different techniques like voting, graph based and information theory approaches of ensembles available. In our work, we have brought out that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also generate robust clusters. Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m×m), where m is the number of clusterers and m<<n, using the cumulative similarity between the clusters of the clusterers. This matrix is used for identifying which two clusterers, if considered for fusion initially, will provide more information gain. This paper discusses the need for simple, elegant yet effective similarity measures for cluster mining. As the underlying data structure is already known in the case of cluster ensembles, we have tried to utilize that knowledge to find the similarity between the probable clusterer merge points. We have used the set theory approach and the Shannon partition entropy as the basis for our calculation of multiparty merge entropy. The correctness and efficiency of the proposed cluster ensemble algorithm is demonstrated by usage of various cluster validity metrics like accuracy, misclassification rate, Dunn indices, inter cluster density and intra cluster density, measured for the real world datasets available in University of California Irvine’s data repository.

Keywords

Clustering Ensembles, Cluster Compatibility Matrix, Cluster Validity Metrics, Partition Entropy, Degree of Over Shadow.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 225

PDF Views: 3




  • Efficient and Effortless Similarity Measures for Cluster Ensembles

Abstract Views: 225  |  PDF Views: 3

Authors

R. J. Anandhi
Dept. of CSE, Dr. MGR University, Chennai, India
Natarajan Subramaniyam
Department of Information Science and Engineering, PESIT, Bangalore, India

Abstract


Spatial data mining basically deals with the discovery of implicit knowledge in spatial data. With the tremendous rise in the accumulation of spatial data, new approaches in spatial data mining has become is an critical requirement. With so many clustering algorithms and their derivatives available,  and also the success stories of bagging and boosting in classification, has brought the area of cluster ensembles to limelight in the last decade.  There are different techniques like voting, graph based and information theory approaches of ensembles available. In our work, we have brought out that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also generate robust clusters. Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m×m), where m is the number of clusterers and m<<n, using the cumulative similarity between the clusters of the clusterers. This matrix is used for identifying which two clusterers, if considered for fusion initially, will provide more information gain. This paper discusses the need for simple, elegant yet effective similarity measures for cluster mining. As the underlying data structure is already known in the case of cluster ensembles, we have tried to utilize that knowledge to find the similarity between the probable clusterer merge points. We have used the set theory approach and the Shannon partition entropy as the basis for our calculation of multiparty merge entropy. The correctness and efficiency of the proposed cluster ensemble algorithm is demonstrated by usage of various cluster validity metrics like accuracy, misclassification rate, Dunn indices, inter cluster density and intra cluster density, measured for the real world datasets available in University of California Irvine’s data repository.

Keywords


Clustering Ensembles, Cluster Compatibility Matrix, Cluster Validity Metrics, Partition Entropy, Degree of Over Shadow.