Open Access Open Access  Restricted Access Subscription Access

Stacked Framework of Machine Learning Classifiers for Protein Family Prediction Using Protein Characteristics


Affiliations
1 Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Abhishekapatti, Tirunelveli 627 012, India
2 School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632 014, India
 

A protein family must be identified, so that the protein can be modified and controlled for using it in the identification of drug target interactions, structure prediction, etc. Protein families are identified using the similarity between protein sequences. Alignment-free approaches use machine learning (ML) techniques for protein family prediction. In this study, two novel ML-based models, viz. a stacked framework of random forest, and a stacked framework of random forest, decision tree and naive Bayes for protein family prediction have been developed for a better identification of protein families. Both the models outperform state-of-the-art methods with an accuracy of 98.21% and 98.49% respectively. The proposed models give better results for twilight zone protein datasets as well.

Keywords

Alignment Free Method, Machine Learning, Protein Family Prediction, Stacked Framework, Twilight-Zone Proteins.
User
Notifications
Font Size

  • https://www.dnastar.com/blog/structural-biology/why-structure-prediction-matters (accessed on 12 September 2022).
  • Ranjini, K., Suruliandi, A. and Raja, S. P., A stacked framework of heterogeneous incremental classifiers for assisted reproductive technology outcome prediction. IEEE Trans. Comput. Soc. Syst., 2021, 8(3), 557–567.
  • Cao, R. et al., DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinform., 2016, 17, 495; https://doi.org/10.1186/s12859-016-1405-y.
  • Mukherjee, S. et al., Genomes OnLine database (GOLD) v.7: updates and new features. Nucleic Acids Res., 2019, 47(D1), D649–D659; https://doi.org/10.1093/nar/gky977.
  • Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C. and Tosatto, S. C., INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res., 2015, 43(W1), W134–W140; https://doi.org/10.1093/nar/gkv523.
  • Wu, S. and Zhang, Y., LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res., 2007, 35(10), 3375–3382; https://doi.org/10.1093/nar/gkm251.
  • Söding, J., Protein homology detection by HMM–HMM comparison. Bioinformatics, 2005, 21(7), 951–960; https://doi.org/10.1093/bioinformatics/bti125.
  • Smaili, F. Z. et al., QAUST: protein function prediction using structure similarity, protein interaction, and functional motifs. Genom. Proteom. Bioinformat., 2021, 19(6), 998–1011; https://doi.org/10.1016/j.gpb.2021.02.001.
  • Cai, C. Z., Han, L. Y., Ji, Z. L., Chen, X. and Chen, Y. Z., SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res., 2003, 31(13), 3692–3697; https://doi.org/10.1093/nar/gkg600.
  • Kanehisa, M. and Goto, S., KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28, 27–30.
  • https://www.sciencelearn.org.nz/resources/1901-proteins-what-they-are-and-how-they-re-made (accessed on 12 September 2022).
  • https://bio.libretexts.org/Bookshelves/Microbiology (accessed on 12 September 2022).
  • https://www.guidetopharmacology.org/targets.jsp (accessed on 12 September 2022).
  • Saeys, Y., Inza, I. and Larrañaga, P., A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23(19), 2507–2517.
  • Chandrashekar, G. and Sahin, F., A survey on feature selection methods. Comput. Electr. Eng., 2014, 40(1), 16–28.
  • Yu, L. and Liu, H., Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings, Twentieth International Conference on Machine Learning (eds Fawcett, T. and Mishra, N.), 2003, vol. 2, pp. 856–863.
  • Jaynes, E. T., Information theory and statistical mechanics II. Phys. Rev., 1957, 108(2), 171–190; Bibcode:1957PhRv.108.171J; doi: 10.1103/physrev.108.171.
  • https://machinelearningmastery.com/feature-selection-machine-learning-python/Chi-square (accessed on 12 September 2022).
  • Wagstaff, K., Machine learning that matters, 2012; arXiv:1206. 4656.
  • Galperin, M. Y., Makarova, K. S., Wolf, Y. I. and Koonin, E. V., Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res., 2015, 43(D1), D261–D269.
  • Xu, Y. et al., Deep dive into machine learning models for protein engineering. J. Chem. Inf. Model., 2020, 60(6), 2773–2790; doi: 10.1021/acs.jcim.0c00073.
  • Yusuf, S. M., Zhang, F., Zeng, M. and Li, M., Deep PPF: a deep learning framework for predicting protein family. Neurocomputing, 2021, 428, 19–29; doi:10.1016/j.neucom.2020.11.062.
  • Blum, M. et al., The InterPro protein families and domains database: 20 years on. Nucleic Acids Res., 2020, 49(D1), D344–D354.
  • https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data (accessed on 2 November 2022).
  • David, H., Fredrick, B., Suruliandi, A. and Raja, S. P., Preventing crimes ahead of time by predicting crime propensity in released prisoners using data mining techniques. Int. J. Appl. Decis. Sci., 2010, 12(3), 307–336; https://github.com/Benjamindavid03/Crime-PropensityPredictionDataset
  • https://www.kaggle.com/datasets/yasserhessein/thyroid-disease-data-set (accessed on 2 November 2022).
  • https://www.kaggle.com/datasets/bumba5341/advertisingcsv (accessed on 2 November 2022).

Abstract Views: 165

PDF Views: 74




  • Stacked Framework of Machine Learning Classifiers for Protein Family Prediction Using Protein Characteristics

Abstract Views: 165  |  PDF Views: 74

Authors

T. Idhaya
Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Abhishekapatti, Tirunelveli 627 012, India
A. Suruliandi
Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Abhishekapatti, Tirunelveli 627 012, India
S. P. Raja
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632 014, India

Abstract


A protein family must be identified, so that the protein can be modified and controlled for using it in the identification of drug target interactions, structure prediction, etc. Protein families are identified using the similarity between protein sequences. Alignment-free approaches use machine learning (ML) techniques for protein family prediction. In this study, two novel ML-based models, viz. a stacked framework of random forest, and a stacked framework of random forest, decision tree and naive Bayes for protein family prediction have been developed for a better identification of protein families. Both the models outperform state-of-the-art methods with an accuracy of 98.21% and 98.49% respectively. The proposed models give better results for twilight zone protein datasets as well.

Keywords


Alignment Free Method, Machine Learning, Protein Family Prediction, Stacked Framework, Twilight-Zone Proteins.

References





DOI: https://doi.org/10.18520/cs%2Fv125%2Fi5%2F508-517