Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

A Modified K-Means Algorithm that Determines Number of Clusters Automatically


Affiliations
1 Banasthali University, Banasthali, Rajasthan, India
     

   Subscribe/Renew Journal


This paper presents a modified k-means that not only increases classification accuracy but also determines optimal number of clusters automatically. It is a two phase clustering algorithm based on k-means, membership degree and standard deviation. Phase I automatically determines the number of clusters. Standard deviation of membership value helps us to identify number of clusters automatically. The phase II increases the classification accuracy. It does not require number of clusters as parameters like other previous most widely used unsupervised pattern clustering algorithms such as Fuzzy C-means and K-means. Experiments were done on some benchmark datasets and synthetic datasets to evaluate the performance of proposed approach.

Keywords

Clustering, Validitymeasures, K-Means, Membership Degree Introduction.
Subscription Login to verify subscription
User
Notifications
Font Size


  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1, 281-296.
  • Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The Fuzzy C-means Clustering Algorithm, Computers and Geosciences, 10(23), 191-203, Pergamon Press Ltd., USA.
  • Sun, H., Wang, S., & Jiang, Q. (2004). FCM-based Model Selection Algorithms for Determining the Number of Clusters, Elsevier, Pattern Recognition, (pp.2027-2037).
  • Liang, Z., Zhang, P., & Zhao, J. (2010). Optimization of the Number of Clusters in Fuzzy Clustering. IEEE International Conference on Computer, Design and Applications.
  • Yan, Z., & Pi, D. (2009). A fuzzy clustering algorithm based on K-means. International Conference on Electronics and Business Intelligence.
  • Li, M. J., Ng, M. K., Cheung, Y. M., & Huang, J. Z. (2008). Agglomerative Fuzzy K-means Clustering algorithm with selection of number of clusters. IEEE Transactions Knowledge and Data Engineering, 20(11).
  • Xie, J., & Jiang, S. (2010). A simple and fast algorithm for global K-means clustering. IEEE 2nd International Workshop on Education Technology and Computer Science.
  • Bagirov, A. M. (2008). Modified global k-means algorithm for minimum sum-of-squares clustering problems, Elsevier, Pattern Recognition.
  • Mansoori, E. G. (2011). FRBC: A fuzzy rule-based clustering algorithm. IEEE Transactions on Fuzzy Systems, 19(5).
  • Ben, S., & Su, G. (2009). A new validity index for fuzzy clustering. IEEE Chinese Conference on Pattern Recognition.
  • Wu, K. L., Yang, Y. M. S., & Hsieh, J. N. (2009). Robust cluster validity indexes. Elsevier Journal on Pattern Recognition, November, 42(11), 2541-2550.
  • Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function, Plenum Press, New York.
  • Bezdek, J. C. (1974). Cluster validity with fuzzy sets. Journal of Cybernetics, 3(3), 58-73.
  • Bezdek, J. C. (1974). Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology, 1(1), 57-71.
  • Dave, R. N. (1996). Validating fuzzy partition obtained through c-shells clustering. Pattern Recognition Letters, May, 17(6), 613-623.
  • Krzanowski, W. J., & Lai, Y. T. (1988). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23-34.
  • Rousseau, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53-65.
  • Calinski, T., & Harabasz, J. (1974). A dendrite method of cluster analysis. Communications in Statistics- Theory & Methods, 3(1)1, 1-27.
  • Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(4), 224-227.
  • Anuar, N., & Zakaria, Z. (2010). Cluster Validity Analysis for Electricity Load Profiling. IEEE International Conference on Power and Energy (PECon2010), Kuala Lumpur, Malaysia.
  • Goldberg, D. (1989). Genetic Algorithms in Search, Optimization and Machine Learning, AddisonWesley Longman Publishing Co., Inc., New York, NY, USA.
  • Sibson, R. (1973). SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society), 16(1), 30-34.
  • Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern Recognition, 36(2), 451-461.
  • Strehl, A., & Ghosh, J. (2002). Relationship-based Clustering and Visualization for High-dimensional Data Mining. INFORMS Journal on Computing, 15(2), 208-230.
  • Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Cybernetics and Systems, 3(3), 32-57.
  • Dunn, J. C. (1973). A fuzzy relative of the Isodata process and its use in detecting compact well-separated clusters. Journal on Cybernetics, 3(3), 32-57.
  • Sharma, J., Panchariya, P. C., & Purohit, G. N. (2013). Clustering Algorithm based on K-means and Fuzzy Entropy for E-nose Applications. IEEE Conference on Advanced Electronics Systems.
  • Su, M. C., & Chou, C. H. (2001). A Modified Version of the K-Means Algorithm with a Distance Based on Cluster Symmetry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 674-680.
  • Posada, D., & Buckley, T. R. (2004). Model selection and model averaging in phylogenetics: Advantages of Akaike Information Criterion and Bayesian approaches over Likelihood Ratio Tests. Systematic Biology, 53(5), 793-808.
  • Corduneanu, A., & Bishop, C. M. (2001). Variational Bayesian Model Selection for Mixture Distributions, Artificial Intelligence and Statistics, T. Jaakkola and T. Richardson (Eds), pp. 27-34.
  • Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473), 168-178.
  • Kearns, M., Masour, Y., Ng, A. Y., & Ron, D. (1997). An Experimental and Theoretical Comparison of Model Selection Methods. Machine Learning, 27(1), 7-50.
  • Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486-494.
  • Kim, D. W., Lee, K. H., & Lee, D. (2004). On cluster validity index for estimation of the optimal number of fuzzy clusters. Pattern Recognition, 37(10), 2009-2025.
  • Wang, W., & Zhang, Y. (2007). On fuzzy cluster validity indices. Fuzzy Sets and Systems, 158(19), 2095-2117.
  • Hartigan, J. A., & Wong, M. A. (1979). A k-means clustering algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), 28(1), 100-108.
  • Malinen, M. I., Istodor, R. M., & Fränti, P. (2014). K-means: Clustering by gradual data transformation. Pattern Recognition, 47, 3376-3386.
  • Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20(10), 1027-1040.
  • Pfeiffer, P. E. (1990). Variance and standard deviation, Chapter- Probability of Applications, Part of the series Springer Texts in Statistics, (pp. 355-370).

Abstract Views: 367

PDF Views: 2




  • A Modified K-Means Algorithm that Determines Number of Clusters Automatically

Abstract Views: 367  |  PDF Views: 2

Authors

Jyoti Sharma
Banasthali University, Banasthali, Rajasthan, India
G. N. Purohit
Banasthali University, Banasthali, Rajasthan, India

Abstract


This paper presents a modified k-means that not only increases classification accuracy but also determines optimal number of clusters automatically. It is a two phase clustering algorithm based on k-means, membership degree and standard deviation. Phase I automatically determines the number of clusters. Standard deviation of membership value helps us to identify number of clusters automatically. The phase II increases the classification accuracy. It does not require number of clusters as parameters like other previous most widely used unsupervised pattern clustering algorithms such as Fuzzy C-means and K-means. Experiments were done on some benchmark datasets and synthetic datasets to evaluate the performance of proposed approach.

Keywords


Clustering, Validitymeasures, K-Means, Membership Degree Introduction.

References