Authors :
Baratam Sai Lahari; Balasani Manasini; Rachapudi Reshma; Sri. Ch. Ratna Babu
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/yc5wckn2
Scribd :
https://tinyurl.com/y7nae92t
DOI :
https://doi.org/10.38124/ijisrt/26apr1516
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This research focuses on developing a machine learning method to study mutations that are related to the
development of cancer in the POLB gene based on the features associated with single nucleotide polymorphisms(SNPs).
Initially, a dataset made up of bioinformatics-derived features like SIFT, PolyPhen2, CADD, and REVEL was pre-processed
and subsequently used as a foundation for the creation of predictive models. Five types of classification algorithms were
applied and assessed: Logistic Regression, Random Forest, Support Vector Machine, Multilayer Perceptron and XGBoost.
To ensure that performance estimates were valid, bootstrap resampling techniques were employed and metrics including
accuracy, precision, recall, F1 score and specificity were calculated. Results from the experiments showed that both ensemble
models (Random Forest and XGBoost) produced the most accurate results approximately 83 percent which indicated that
these models can capture complex relations in SNP data. In addition, SHAP explanation methods were used to explain model
predictions and determine the features that had the largest effects on classification decisions. The study indicated that
machine learning techniques have many applications in genomic research, particularly when it comes to outcomes associated
with mutations that lead to cancers.
Keywords :
POLB, Single Nucleotide Polymorphism, Machine Learning, Cancer Mutation Prediction, Random Forest, XGBoost, SHAP Explainability.
References :
- R. Alkhanbouli, A. Al-Aamri, M. Maalouf, K. Taha, A. Henschel, and D. Homouz, “Analysis of cancer-associated mutations of POLB using machine learning and bioinformatics,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 21, no. 5, pp. 1436–1444, 2024.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- S. M. Lundberg and S. I. Lee, “A unified approach to interpreting model predictions,” Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774, 2017.
- P. C. Ng and S. Henikoff, “SIFT: Predicting amino acid changes that affect protein function,” Nucleic Acids Research, vol. 31, no. 13, pp. 3812–3814, 2003.
- I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev, “A method and server for predicting damaging missense mutations,” Nature Methods, vol. 7, no. 4, pp. 248–249, 2010.
- M. Kircher, D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, and J. Shendure, “A general framework for estimating the relative pathogenicity of human genetic variants,” Nature Genetics, vol. 46, no. 3, pp. 310–315, 2014.
- N. Ioannidis, V. J. Rothstein, V. Pejaver, J. Middha, S. McDonnell, J. Baheti, A. Musolf, H. Li, S. E. Pendergrass, D. A. Bick, et al., “REVEL: An ensemble method for predicting the pathogenicity of rare missense variants,” American Journal of Human Genetics, vol. 99, no. 4, pp. 877–885, 2016.
- C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
This research focuses on developing a machine learning method to study mutations that are related to the
development of cancer in the POLB gene based on the features associated with single nucleotide polymorphisms(SNPs).
Initially, a dataset made up of bioinformatics-derived features like SIFT, PolyPhen2, CADD, and REVEL was pre-processed
and subsequently used as a foundation for the creation of predictive models. Five types of classification algorithms were
applied and assessed: Logistic Regression, Random Forest, Support Vector Machine, Multilayer Perceptron and XGBoost.
To ensure that performance estimates were valid, bootstrap resampling techniques were employed and metrics including
accuracy, precision, recall, F1 score and specificity were calculated. Results from the experiments showed that both ensemble
models (Random Forest and XGBoost) produced the most accurate results approximately 83 percent which indicated that
these models can capture complex relations in SNP data. In addition, SHAP explanation methods were used to explain model
predictions and determine the features that had the largest effects on classification decisions. The study indicated that
machine learning techniques have many applications in genomic research, particularly when it comes to outcomes associated
with mutations that lead to cancers.
Keywords :
POLB, Single Nucleotide Polymorphism, Machine Learning, Cancer Mutation Prediction, Random Forest, XGBoost, SHAP Explainability.