Authors :
Dhanashree Kulkarni; Nikita Vikrant Chavan; Dr. Manisha Bharati
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/bde3namb
Scribd :
https://tinyurl.com/39dh9292
DOI :
https://doi.org/10.38124/ijisrt/26apr2270
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Diabetes is a chronic and progressive metabolic disorder that afflicts millions of people globally, making it one of
the most significant health issues of the modern era. Timely and accurate prediction of the probability of diabetes is essential
for promoting early intervention and reducing the impact of potential complications. This work presents GlucoSense AI, an
end-to-end machine learning pipeline for predicting the development of diabetes, evaluated on two publicly available
datasets: the UCI Early-Stage Diabetes Risk Prediction Dataset (n=520, d=16) and the BRFSS Diabetes Binary Health
Indicators Data Set (n=253,680, d=21). The proposed approach addresses several key challenges simultaneously. Firstly,
Hybrid SMOTETomek resampling helps mitigate class imbalance by employing both synthetic minority oversampling and
Tomek link elimination methods. Secondly, Recursive Feature Elimination with Cross-Validation (RFECV) is employed to
select the optimal feature subset(s). Finally, Optuna's Bayesian optimization algorithm tunes the hyperparameters of three
gradient-boosting algorithms: LightGBM, XGBoost, and CatBoost, each trained over 100 iterations.Fourthly, the improved
models are incorporated into an ensemble using Logistic Regression as a meta-learner for stacking. After this, Platt sigmoid
calibration is done to ensure that the ensemble returns reliable probability scores. SHAP (SHapley Additive exPlanations)
provides insights into decision-making processes within the models, not only at a global level but at an instance-specific level
too. GlucoSense AI Pro is a ready-to-use production application implemented as a Streamlit web app with user
authentication. CatBoost yields the best ROC-AUC score of 0.9988 on the UCI dataset, while the calibrated stacking
ensemble gets 0.9977. In the case of BRFSS, CatBoost takes the lead by scoring 0.8150 AUC, while the calibrated ensemble
gets 0.8026.
Keywords :
Diabetes Prediction, Ensemble Learning, Stacking Classifier, LightGBM, XGBoost, CatBoost, Bayesian Optimisation, Optuna, SHAP Explainability, SMOTETomek, RFECV, Streamlit Deployment.
References :
- O. Iparraguirre-Villanueva, K. Espinola-Linares, R. O. Flores Castañeda, and M. Cabanillas-Carbonell, “Application of machine learning models for early detection and accurate classification of type 2 diabetes,” Diagnostics, vol. 13, no. 14, p. 2383, Jul. 2023.
- B. Madhu, V. Aerranagula, R. Mahomad, V. Ravindernaik, K. Madhavi, and G. Krishna, “Techniques of machine learning for the purpose of predicting diabetes risk in PIMA Indians,” E3S Web of Conferences, vol. 430, 2023.
- S. Upadhyay and Y. K. Gupta, “Enhancing early diagnosis of type II diabetes through feature selection and hybrid metaheuristic optimization techniques,” The Open Bioinformatics Journal, vol. 18, 2025.
- S. Mansouri, S. Boulares, and S. Chabchoub, “Machine learning for early diabetes detection and diagnosis,” Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, vol. 15, no. 1, pp. 216–230, Mar. 2024.
- K. G. Reddy, M. Madhuri, S. K. Shabeena, P. S. Gopal, and K. Y. Koteswararao, “Prediction of diabetes in early stage through machine learning,” International Journal for Modern Trends in Science and Technology, vol. 10, no. 9, pp. 81–91, 2024.
- A. A. Mahindre and S. A. Kondekar, “Proactive health monitoring: Predictive analytics for early detection of diabetes risk,” EasyChair Preprint, no. 15225, Oct. 2024.
- A. Ahmed, J. Khan, M. Arsalan, K. Ahmed, A. A. Shahat, A. Alhalmi, and S. Naaz, “Machine learning algorithm-based prediction of diabetes among female population using PIMA dataset,” Healthcare, vol. 13, no. 1, p. 37, 2025.
- H. M. Deberneh and I. Kim, “Prediction of type 2 diabetes based on machine learning algorithm,” International Journal of Environmental Research and Public Health, vol. 18, no. 6, p. 3317, 2021.
- S. R. Mishra and S. Dash, “Predictive analysis on diabetes detection using Pima Indian diabetes dataset,” International Journal of Research and Analytical Reviews, vol. 11, no. 2, 2024.
- G. S and V. S. Reddy, “Type 2 diabetes mellitus: Early detection using machine learning classification,” International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, 2023.
- S. Pandya, “Predicting diabetes mellitus in healthcare: A comparative analysis of machine learning algorithms,” International Journal of Current Engineering and Technology, vol. 13, no. 6, pp. 545–546, Dec. 2023.
- M. Al-Tawil, B. A. Mahafzah, A. Al Tawil, and I. Aljarah, “Bio-inspired machine learning approach to type 2 diabetes detection,” Symmetry, vol. 15, no. 3, p. 764, Mar. 2023.
- H. V. T. Huynh, L. Hui, N. H. Nguyen, and R. Qiao, “Performance analysis of diabetes detection using machine learning classifiers,” International Journal of Management and Data Analytics, vol. 4, no. 1, pp. 43–54, Oct. 2024.
- S. Mahajan, M. Rohra, P. K. Sarangi, and A. K. Sahoo, “Diabetes mellitus prediction using supervised machine learning techniques,” in Proc. International Conference on Advancement in Computation & Computer Technologies (InCACCT), 2023.
- B. Badeji-Ajisafe et al., “Early detection of diabetes using supervised learning approach,” in IEEE Conference, 2023.
- G. Tripathi and R. Kumar, “Early prediction of diabetes mellitus using machine learning,” in IEEE Conference, 2023.
- S. Mishra, V. A, and K. S, “Machine learning approaches for type-2 diabetes software predictor,” in IEEE Conference, 2023.
- H. O. Menge and P. Kuppuraj, “Machine learning-based early type 2 diabetes prediction,” in Proc. International Conference on Emerging Research in Computational Science (ICERCS), 2024.
- B. F. Wee, S. Sivakumar, K. H. Lim, W. K. Wong, and F. H. Juwono, “Diabetes detection based on machine learning and deep learning approaches,” Multimedia Tools and Applications, vol. 83, pp. 24153–24185, 2024.
- L. Kopitar, P. Kocbek, L. Cilar, A. Sheikh, and G. Stiglic, “Early detection of type 2 diabetes mellitus using machine learning-based prediction models,” Scientific Reports, vol. 10, p. 11981, 2020.
- M. Matboli et al., “Machine learning-based stratification of prediabetes and type 2 diabetes progression,” Diabetology & Metabolic Syndrome, vol. 17, p. 227, 2025.
- C. H. Paparao et al., “Diabetes detection using machine learning,” International Journal of Creative Research Thoughts, vol. 12, no. 5, May 2024.
- P. Chowdhury, P. Barua, and M. N. Uddin, “Diabetes prediction using machine learning and hybrid deep learning ensemble technique,” in Proc. IEEE Int. Conf. on Computing, Applications and Systems (COMPAS), 2024.
- J. D. Akinyemi et al., “Machine learning-based diabetes risk prediction using associated behavioral features,” Computational Open Journal, 2024.
- K. C. Howlader et al., “Diabetes prediction using machine learning,” Journal of Electrical Systems, vol. 20, no. 7, pp. 2244–2257, 2024.
- G. Dharmarathne, “A novel machine learning approach for diagnosing diabetes using explainable AI,” Healthcare Analytics, 2024.
- B. Nguyen and Y. Zhang, “A comparative study of diabetes prediction based on lifestyle factors using machine learning,” arXiv preprint, 2025.
- M. Hasan and F. Yasmin, “Predicting diabetes using machine learning: A comparative study of classifiers,” arXiv preprint, 2025.
- P. B. Khokhar, C. Gravino, and F. Palomba, “Advances in artificial intelligence for diabetes prediction: Insights from a systematic literature review,” arXiv preprint, 2024.
- A. Hennebelle, H. Materwala, and L. Ismail, “HealthEdge: A machine learning-based smart healthcare framework for prediction of type 2 diabetes in IoT-edge-cloud systems,” arXiv preprint, 2023
Diabetes is a chronic and progressive metabolic disorder that afflicts millions of people globally, making it one of
the most significant health issues of the modern era. Timely and accurate prediction of the probability of diabetes is essential
for promoting early intervention and reducing the impact of potential complications. This work presents GlucoSense AI, an
end-to-end machine learning pipeline for predicting the development of diabetes, evaluated on two publicly available
datasets: the UCI Early-Stage Diabetes Risk Prediction Dataset (n=520, d=16) and the BRFSS Diabetes Binary Health
Indicators Data Set (n=253,680, d=21). The proposed approach addresses several key challenges simultaneously. Firstly,
Hybrid SMOTETomek resampling helps mitigate class imbalance by employing both synthetic minority oversampling and
Tomek link elimination methods. Secondly, Recursive Feature Elimination with Cross-Validation (RFECV) is employed to
select the optimal feature subset(s). Finally, Optuna's Bayesian optimization algorithm tunes the hyperparameters of three
gradient-boosting algorithms: LightGBM, XGBoost, and CatBoost, each trained over 100 iterations.Fourthly, the improved
models are incorporated into an ensemble using Logistic Regression as a meta-learner for stacking. After this, Platt sigmoid
calibration is done to ensure that the ensemble returns reliable probability scores. SHAP (SHapley Additive exPlanations)
provides insights into decision-making processes within the models, not only at a global level but at an instance-specific level
too. GlucoSense AI Pro is a ready-to-use production application implemented as a Streamlit web app with user
authentication. CatBoost yields the best ROC-AUC score of 0.9988 on the UCI dataset, while the calibrated stacking
ensemble gets 0.9977. In the case of BRFSS, CatBoost takes the lead by scoring 0.8150 AUC, while the calibrated ensemble
gets 0.8026.
Keywords :
Diabetes Prediction, Ensemble Learning, Stacking Classifier, LightGBM, XGBoost, CatBoost, Bayesian Optimisation, Optuna, SHAP Explainability, SMOTETomek, RFECV, Streamlit Deployment.