Open Access
Subscription Access
Open Access
Subscription Access
Efficient and Effortless Similarity Measures for Cluster Ensembles
Subscribe/Renew Journal
Spatial data mining basically deals with the discovery of implicit knowledge in spatial data. With the tremendous rise in the accumulation of spatial data, new approaches in spatial data mining has become is an critical requirement. With so many clustering algorithms and their derivatives available, and also the success stories of bagging and boosting in classification, has brought the area of cluster ensembles to limelight in the last decade. There are different techniques like voting, graph based and information theory approaches of ensembles available. In our work, we have brought out that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also generate robust clusters. Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m×m), where m is the number of clusterers and m<<n, using the cumulative similarity between the clusters of the clusterers. This matrix is used for identifying which two clusterers, if considered for fusion initially, will provide more information gain. This paper discusses the need for simple, elegant yet effective similarity measures for cluster mining. As the underlying data structure is already known in the case of cluster ensembles, we have tried to utilize that knowledge to find the similarity between the probable clusterer merge points. We have used the set theory approach and the Shannon partition entropy as the basis for our calculation of multiparty merge entropy. The correctness and efficiency of the proposed cluster ensemble algorithm is demonstrated by usage of various cluster validity metrics like accuracy, misclassification rate, Dunn indices, inter cluster density and intra cluster density, measured for the real world datasets available in University of California Irvine’s data repository.
Keywords
Clustering Ensembles, Cluster Compatibility Matrix, Cluster Validity Metrics, Partition Entropy, Degree of Over Shadow.
User
Subscription
Login to verify subscription
Font Size
Information
Abstract Views: 226
PDF Views: 3