Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Clustering Techniques for Biological Sequence Analysis: a Review


Affiliations
1 Deaprtment of Computer Science, Maharaja Ganga Singh University, Bikaner, Rajasthan, India
     

   Subscribe/Renew Journal


In the present scenario there are a variety of technical tools for supporting and validating wet-lab experiments in the field of science and biotechnology. In order to analyze biological sequences it is necessary to group similar genes. Grouping of genes can be done by using various techniques like pattern matching, classification, clustering etc. In the present study clustering is used as a tool for analyzing biological data. Clustering of Biological sequences is a very interesting and fascinating area as various researchers are working on it. But simple clustering algorithms are not much suitable for sequence analysis problems. Most of the biological sequence analysis problems are NP-hard and some strong optimization algorithm are required for these types of problems.

The manuscript presented here is a survey of various clustering techniques useful for analysis of biological sequences. The 3+ stage review process is adopted for the review of literature. To prepare this report 98 papers have been reviewed from year 1997 to 2014 according to the year of publish. The papers reviewed have discussed various issues related to the analysis of biological sequences. The major issues discovered in the reviewed papers were prediction, sequence alignment, motif discovery, cluster boundary prediction etc. Various solution approaches used by researchers for the biological sequence analysis are evolutionary clustering, neural networks, hierarchical clustering, k-means, Go technologies, feature selection, incremental approach, bio-inspired methods, particle swarm optimization, fuzzy techniques, rough set theory and bi-clustering etc. Researchers have applied these solution approaches on various types of datasets. In this communication we have also discussed about these datasets and the parameters used with results mentioned in papers.


Keywords

Biological Sequences, Sequence Analysis, Clustering, Sequence Clustering.
Subscription Login to verify subscription
User
Notifications
Font Size


  • Ayadi, W. (2012). Pattern-driven Neighbourhood Search for Bi-clustering of Microarray Data. BMC Bioinformatics. Retrieved from http://www.biomedcentral.com/ 1471-2105/13/S7/S11
  • Bagyamani, J. (2014). Biological significance of gene expression data using similarity based bi-clustering algorithm. International Journal of Biometrics and Bioinformatics, 4(6), 201-216.
  • Bandyopadhyay, S. (2007). An improved algorithm for clustering gene expression data. Bioinformatics, 23(21), 2859-2865.
  • Baridam, B. (2013). Alternate Biological Sequence Clustering using Symbol Table. Science and Information Conference.
  • Boratyn, G. M. (2006). Biologically Supervised Hierarchical Clustering Algorithms for Gene Expression Data. Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, pp. 5515
  • Brevern, D. (2004). Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics.
  • Campello, R. J. G. B. (2012). A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data. IEEE/ ACM Transactions on Computational Biology and Bioinformatics, 6(6), 1850-1852.
  • Cheng, Y., & Church, G. M. (2000). Bi-clustering of Expression Data. Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, pp. 93-103.
  • Chezhian, V. (2011). Hierarchical Sequence Clustering Algorithm for Data Mining. Proceedings of the World Congress on Engineering.
  • Corchado, E. (2006). Neighbourhood-based Clustering of Gene-Gene Interactions. LNCS 4224, (pp. 11111120), Springer-Verlag Berlin Heidelberg.
  • Das, R. (2010). Clustering gene expression data using an effective dissimilarity measure. International Journal of Computational Bioscience, 1(1), 55-68.
  • Das, R. (2014). Bi-clustering of gene expression data using a two-phase method. International Journal of Computer Applications, 103(13), 6-10.
  • Dawson, K. (2005). Sample phenotype clusters in highdensity oligonucleotide microarray data sets are revealed using Isomap, A nonlinear algorithm. BMC Bioinformatics, 6, 1-17.
  • Dettling, M. (2002). Supervised clustering of genes.Genome Biology, 3(12), 1-15.
  • Dib, L., & Carbone, A. (2012). CLAG: An unsupervised non hierarchical clustering algorithm handling biological data. BMC Bioinformatics. Retrieved from http://www.biomedcentral.com/1471-2105/13/194
  • Dresen, G. (2008). New re-sampling method for evaluating stability of clusters. BMC Bioinformatics.
  • Dutta, D. (2012). Data Clustering with Mixed Features by Multi Objective Genetic Algorithm. 12th International Conference on Hybrid Intelligent Systems (HIS), (pp. 336-341).
  • Datta, D., & Dutta, P. (2006). Evaluation of clustering algorithms for gene expression data. BMC Bioinformatics, 7(4), S4-S17.
  • Freudenberg, J. M. (2009). CLEAN: Clustering enrichment analysis. BMC Bioinformatics.
  • Fu, L. (2007). FLAME- A novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics, pp-1-15.
  • Gao, C. (2013). Rough subspace-based clustering ensemble for categorical data. Soft Computing, 17(9), 1643-1658.
  • Geng, H. (2004). A new approach to clustering biological data using message passing. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004).
  • Getz, G. (2000). Coupled Two-way Clustering Analysis of Gene Microarray Data. Proceedings of National Academy of Science, 97(22), 12079-12084.
  • Gusenleitner, D. (2012). iBBiG: Iterative binary bi-clustering of gene sets. Bioinformatics, 28(19), 24842492. Oxford University Press
  • Hanczar, B. (2011). Improving the biological relevance of bi-clustering for microarray data in using ensemble methods. 22nd International Workshop on Database and Expert Systems Applications, (pp. 413-417).
  • Hartsperger, M. L. (2010). Structuring heterogeneous biological information using fuzzy clustering of k-partite graphs. BMC Bioinformatics, 11(522), 1471-2105.
  • Hatamlou, A. (2013). Black hole: A new heuristic optimization approach for data clustering. Information Sciences, 222, 175-184.
  • Henriques, R. (2014). BicSPAM: Flexible bi-clustering using sequential Patterns. BMC Bioinformatics, 2014, 15, 130.
  • Huang, C. H. (2002). Parallel Pattern Identification in Biological Sequences on Clusters. Proceedings of the IEEE International Conference on Cluster Computing, (pp. 1- 8).
  • Jain, A. (1999). Data Clustering: A Review. ACM Computing Surveys, September, 31(3), 264-323.Jaskowiak, P. A. (2014). On the Selection of Appropriate Distances for Gene Expression Data Clustering. 12th Asia Pacific Bioinformatics Conference, (pp. 17-19).
  • Jonnalagadda, S. (2009). NIFTI: An evolutionary approach for finding number of clusters in microarray data. BMC Bioinformatics.
  • Joung, J. G. (2012). A probabilistic co-evolutionary biclustering algorithm for discovering coherent patterns in gene expression dataset. BMC Bioinformatics, 13(17), S12. Retrieved from http://www.biomedcentral. com/1471-2105/13/S17/S12
  • Kelil, A. (2007). CLUSS: Clustering of protein sequences based on a new similarity Measure. BMC Bioinformatics, (pp. 1-19).
  • Kim, S. Y. (2006). Effect of data normalization on fuzzy clustering of DNA microarray Data. BMC Bioinformatics, (pp-1-14).
  • Kim, E. Y. (2009). MULTI-K: Accurate classification of microarray subtypes using ensemble k-means clustering.
  • Kléma, J. (2006). Mining Plausible Patterns from Genomic Data. 19th IEEE Symposium on Computerbased Medical Systems.
  • Kraus, J. M. & Kestler, H. A. (2010). A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics.
  • Krause, J. M. (2005). Large Scale Hierarchical Clustering of Protein Sequences. BMC Bioinformatics, (pp.1-12).
  • Krishna, R. (2010). A temporal precedence based clustering method for gene expression microarray data. BMC Bioinformatics, 11-68
  • Kuo, R. J. (2013). Automatic Clustering using an improved particle swarm optimization. Journal of Industrial and Intelligent Information, 1(1), pp- 46-51.
  • Lee, A. J. T. (2005). Cluster Utility: A New Metric for Clustering Biological Sequences. Proceedings of the IEEE Computational Systems Bioinformatics Conference Workshops (CSBW’05).
  • Leon, E. (2006). ECSAGO: Evolutionary Clustering using Self Adaptive Genetic Operators. IEEE Conference on Evolutionary Computation, pp. 1768.
  • Levenstien, M. A. (2003). Statistical significance for hierarchical clustering in genetic association and microarray expression studies. BMC Bioinformatics, (pp-1-9).
  • Li, X. L. (2006). Systematic gene function prediction from gene expression data by using a fuzzy nearestcluster method. BMC Bioinformatics, 7(4), S23
  • Li, X. L. (2009). QUBIC: A qualitative bi-clustering algorithm for analyses of gene expression data. Nucleic Acids Research, The Authors, 37, 1-10.
  • Li, X. L. (2012). A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. Bio Data Mining, (pp-15).
  • Liu, X. (2007). Computing the maximum similarity biclusters of gene expression data. Bioinformatics, 23(1), 50-56.
  • Lock, E. F. (2010). Clustering for low-density microarrays and its application to QPCR. BMC Bioinformatics, 11(386), 1471-2105.
  • Lu, L. (2004). Incremental genetic K-means algorithm and its application in gene expression data analysis.BMC Bioinformatics, pp-1-10.
  • Ma, P. C., & Chan, K. C. (2009). An Iterative Data Mining Approach for Mining Overlapping Co-expression Patterns in Noisy Gene Expression Data. IEEE Transactions on Nano-bioscience, 8(3), 252.
  • Marttinen, P. (2009). Bayesian clustering and feature selection for cancer tissue samples. BMC Bioinformatics.
  • Maulik, U. (2009). Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinformatics.
  • Medeiros, D. M. R. (2005). Applying Text Mining and Machine Learning Techniques to Gene Clusters Analysis. 6th International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05), (pp. 1-6).
  • Mohamad, M. S. (2013). An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes, Algorithms for Molecular Biology, 8(1), 1-15.
  • Nesamalar, E. K. (2012). Genetic Clustering with Bee Colony Optimization for Flexible Protein-ligand Docking. International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012), pp. 82-87.
  • Newman, A. M. & Cooper, J. (2010). Auto SOME: A clustering method for identifying gene expression modules without prior knowledge of cluster number”, BMC Bioinformatics, 11:117
  • Mishra, N. (2011). A framework for associated pattern mining over microarray database. Journal of Global Research in Computer Science, February, 2(2), 8-11.
  • Oghabian, A. (2014). Bi-clustering methods: Biological relevance and application in gene expression analysis.PLoS One, 9(3), 1-10.
  • Okada, Y. (2005). Detection of Cluster Boundary in Microarray Data by Reference to MIPS Functional Catalogue Database.
  • Preli´c, A. (2006). A systematic comparison and evaluation of bi-clustering methods for gene expression data. Bioinformatics Advance Access. Oxford University Press.
  • Quest, D. & Ali, H. (2005). A Grammar Based Approach for Mining Bioinformatics Databases. Proceedings of the 38th Hawaii International Conference on System Sciences.
  • Dharan, S. (2009). Randomized adaptive search procedure.BMC Bioinformatics.
  • Rathipriya, R. (2011). Evolutionary Bi-clustering of Click-stream Data. International Journal of Computer Science Issues, 8(3), 341-347.
  • Reiss, D. J. (2006). Integrated bi-clustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics, 22(9), 1-22.
  • Robinson, M. D. (2002). FunSpec: A web-based cluster interpreter for yeast. BMC Bioinformatics, 1-5.
  • Roy, S. (2013). Reconstruction of Gene Co-expression Network from Microarray Data using Local Expression Patterns. 10th Annual Biotechnology and Bioinformatics Symposium (BIOT 2013) Provo, UT, USA.
  • Sakthi, M. & Thanamani, A. S. (2013). An enhanced K means clustering using improved HOP field artificial neural network and genetic algorithm. International Journal of Recent Technology and Engineering, 2(3), 16-21.
  • Santamaría, R. (2008). A visual analytics approach for understanding bi-clustering results from microarray data. BMC Bioinformatics, 1-19.
  • Sarmah, S. & Bhattacharyya, D. K. (2010). An effective technique for clustering incremental gene expression data. International Journal of Computer Science Issues, 7(3), 31-41.
  • Sathishkumar, K. (2008). Identification of Bi-clustering Algorithms for Gene Extraction. International Journal of Engineering Sciences & Research Technology, 2(10), 13.
  • Scharl, T. (2009). Exploratory and Inferential Analysis of Gene Cluster Neighbourhood Graphs. BMC Bioinformatics.
  • Semeiks, J. R. (2006). Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features. BMC Bioinformatics, (pp. 1-11).
  • Shamir, R. (2005). EXPANDER - An integrative program suite for microarray data analysis. BMC Bioinformatics, (pp. 1-12).
  • Sharan, R. (2003). CLICK and EXPANDER: A system for clustering and visualizing gene expression data. Bioinformatics, 19(14), 1787-1799. Oxford University Press,
  • Shimokawa, K. (2007). Large-scale clustering of CAGE tag expression data. BMC Bioinformatics.
  • Shon, H. S. (2008). Clustering Microarray Data by Using a Stochastic Algorithm. IEEE 8th International Conference on Computer and Information Technology Workshops, (pp. 456-461).
  • Simpson, T. I. (2010). Merged consensus clustering to assess and improve class discovery with microarray data. BMC Bioinformatics, 11,590.
  • Smolkin, M. (2003). Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics, 1-7.
  • Song, Q. (2013). A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data. IEEE Transactions on Knowledge and Data Engineering, 25(1).
  • Swathi, H. (2010). Gene expression data knowledge discovery using global and local clustering. Journal of Computing, 2(3), 116-133.
  • Tetko, I. V. (2005). Super paramagnetic clustering of protein sequences. BMC Bioinformatics, 1-13.
  • Thomas, J. (2011). A novel fuzzy clustering method for outlier detection in data mining. International Journal of Recent Trends in Engineering, 1(2), 161-165.
  • Tjaden, B. (2006). An approach for clustering gene expression data with error Information. BMC Bioinformatics, 1-15.
  • Toronen, P. (2004). Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics, 5(32), 1-19.
  • Tritchler, D. (2009). Filtering genes for cluster and network analysis. BMC Bioinformatics.
  • Verbanck, M. (2013). A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data. BMC Bioinformatics, 14(1), 42.
  • Wang, J. (2002). Clustering of the SOM easily reveals distinct gene expression patterns: Results of a reanalysis of lymphoma study. BMC Bioinformatics, 1-9.
  • Wang, J. (2003). Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data. BMC Bioinformatics, 1-12.
  • Xu, T. (2009). Improving detection of differentially expressed gene sets by applying cluster enrichment analysis to Gene Ontology. BMC Bioinformatics.
  • Yao, J. (2008). Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient. BMC Bioinformatics.
  • Yi, S. G. (2008). Response projected clustering for direct association with physiological and clinical response data. BMC Bioinformatics, 9, 76.
  • Yin, L. (2006). Clustering of gene expression data: Performance and similarity analysis. BMC Bioinformatics, 7(4), S4-S19.
  • Zhang, Z. Y. (2010). Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery, 28-52.
  • Zheng, Y. (2005). Phylo-genetic detection of conserved gene clusters in microbial Genomes. BMC Bioinformatics, 1-14.
  • Zhu, Y. (2008). caBIG VISDA: Modelling, visualization, and discovery for cluster analysis of genomic data. BMC Bioinformatics.

Abstract Views: 440

PDF Views: 0




  • Clustering Techniques for Biological Sequence Analysis: a Review

Abstract Views: 440  |  PDF Views: 0

Authors

Jyoti Lakhani
Deaprtment of Computer Science, Maharaja Ganga Singh University, Bikaner, Rajasthan, India
Anupama Chowdhary
Deaprtment of Computer Science, Maharaja Ganga Singh University, Bikaner, Rajasthan, India
Dharmesh Harwani
Deaprtment of Computer Science, Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Abstract


In the present scenario there are a variety of technical tools for supporting and validating wet-lab experiments in the field of science and biotechnology. In order to analyze biological sequences it is necessary to group similar genes. Grouping of genes can be done by using various techniques like pattern matching, classification, clustering etc. In the present study clustering is used as a tool for analyzing biological data. Clustering of Biological sequences is a very interesting and fascinating area as various researchers are working on it. But simple clustering algorithms are not much suitable for sequence analysis problems. Most of the biological sequence analysis problems are NP-hard and some strong optimization algorithm are required for these types of problems.

The manuscript presented here is a survey of various clustering techniques useful for analysis of biological sequences. The 3+ stage review process is adopted for the review of literature. To prepare this report 98 papers have been reviewed from year 1997 to 2014 according to the year of publish. The papers reviewed have discussed various issues related to the analysis of biological sequences. The major issues discovered in the reviewed papers were prediction, sequence alignment, motif discovery, cluster boundary prediction etc. Various solution approaches used by researchers for the biological sequence analysis are evolutionary clustering, neural networks, hierarchical clustering, k-means, Go technologies, feature selection, incremental approach, bio-inspired methods, particle swarm optimization, fuzzy techniques, rough set theory and bi-clustering etc. Researchers have applied these solution approaches on various types of datasets. In this communication we have also discussed about these datasets and the parameters used with results mentioned in papers.


Keywords


Biological Sequences, Sequence Analysis, Clustering, Sequence Clustering.

References