Open Access
Subscription Access
Stacked Framework of Machine Learning Classifiers for Protein Family Prediction Using Protein Characteristics
A protein family must be identified, so that the protein can be modified and controlled for using it in the identification of drug target interactions, structure prediction, etc. Protein families are identified using the similarity between protein sequences. Alignment-free approaches use machine learning (ML) techniques for protein family prediction. In this study, two novel ML-based models, viz. a stacked framework of random forest, and a stacked framework of random forest, decision tree and naive Bayes for protein family prediction have been developed for a better identification of protein families. Both the models outperform state-of-the-art methods with an accuracy of 98.21% and 98.49% respectively. The proposed models give better results for twilight zone protein datasets as well.
Keywords
Alignment Free Method, Machine Learning, Protein Family Prediction, Stacked Framework, Twilight-Zone Proteins.
User
Font Size
Information
- https://www.dnastar.com/blog/structural-biology/why-structure-prediction-matters (accessed on 12 September 2022).
- Ranjini, K., Suruliandi, A. and Raja, S. P., A stacked framework of heterogeneous incremental classifiers for assisted reproductive technology outcome prediction. IEEE Trans. Comput. Soc. Syst., 2021, 8(3), 557–567.
- Cao, R. et al., DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinform., 2016, 17, 495; https://doi.org/10.1186/s12859-016-1405-y.
- Mukherjee, S. et al., Genomes OnLine database (GOLD) v.7: updates and new features. Nucleic Acids Res., 2019, 47(D1), D649–D659; https://doi.org/10.1093/nar/gky977.
- Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C. and Tosatto, S. C., INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res., 2015, 43(W1), W134–W140; https://doi.org/10.1093/nar/gkv523.
- Wu, S. and Zhang, Y., LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res., 2007, 35(10), 3375–3382; https://doi.org/10.1093/nar/gkm251.
- Söding, J., Protein homology detection by HMM–HMM comparison. Bioinformatics, 2005, 21(7), 951–960; https://doi.org/10.1093/bioinformatics/bti125.
- Smaili, F. Z. et al., QAUST: protein function prediction using structure similarity, protein interaction, and functional motifs. Genom. Proteom. Bioinformat., 2021, 19(6), 998–1011; https://doi.org/10.1016/j.gpb.2021.02.001.
- Cai, C. Z., Han, L. Y., Ji, Z. L., Chen, X. and Chen, Y. Z., SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res., 2003, 31(13), 3692–3697; https://doi.org/10.1093/nar/gkg600.
- Kanehisa, M. and Goto, S., KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28, 27–30.
- https://www.sciencelearn.org.nz/resources/1901-proteins-what-they-are-and-how-they-re-made (accessed on 12 September 2022).
- https://bio.libretexts.org/Bookshelves/Microbiology (accessed on 12 September 2022).
- https://www.guidetopharmacology.org/targets.jsp (accessed on 12 September 2022).
- Saeys, Y., Inza, I. and Larrañaga, P., A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23(19), 2507–2517.
- Chandrashekar, G. and Sahin, F., A survey on feature selection methods. Comput. Electr. Eng., 2014, 40(1), 16–28.
- Yu, L. and Liu, H., Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings, Twentieth International Conference on Machine Learning (eds Fawcett, T. and Mishra, N.), 2003, vol. 2, pp. 856–863.
- Jaynes, E. T., Information theory and statistical mechanics II. Phys. Rev., 1957, 108(2), 171–190; Bibcode:1957PhRv.108.171J; doi: 10.1103/physrev.108.171.
- https://machinelearningmastery.com/feature-selection-machine-learning-python/Chi-square (accessed on 12 September 2022).
- Wagstaff, K., Machine learning that matters, 2012; arXiv:1206. 4656.
- Galperin, M. Y., Makarova, K. S., Wolf, Y. I. and Koonin, E. V., Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res., 2015, 43(D1), D261–D269.
- Xu, Y. et al., Deep dive into machine learning models for protein engineering. J. Chem. Inf. Model., 2020, 60(6), 2773–2790; doi: 10.1021/acs.jcim.0c00073.
- Yusuf, S. M., Zhang, F., Zeng, M. and Li, M., Deep PPF: a deep learning framework for predicting protein family. Neurocomputing, 2021, 428, 19–29; doi:10.1016/j.neucom.2020.11.062.
- Blum, M. et al., The InterPro protein families and domains database: 20 years on. Nucleic Acids Res., 2020, 49(D1), D344–D354.
- https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data (accessed on 2 November 2022).
- David, H., Fredrick, B., Suruliandi, A. and Raja, S. P., Preventing crimes ahead of time by predicting crime propensity in released prisoners using data mining techniques. Int. J. Appl. Decis. Sci., 2010, 12(3), 307–336; https://github.com/Benjamindavid03/Crime-PropensityPredictionDataset
- https://www.kaggle.com/datasets/yasserhessein/thyroid-disease-data-set (accessed on 2 November 2022).
- https://www.kaggle.com/datasets/bumba5341/advertisingcsv (accessed on 2 November 2022).
Abstract Views: 165
PDF Views: 74