Optimizing Heart Disease Diagnosis: Feature Selection Techniques for Enhanced Machine Learning Model Performance


Authors : Ravinder Kaur; Sonia Rani; Chitra Desai; Sagar Jambhorkar

Volume/Issue : Volume 9 - 2024, Issue 9 - September


Google Scholar : https://tinyurl.com/yn4ssbfw

Scribd : https://tinyurl.com/bde9268u

DOI : https://doi.org/10.38124/ijisrt/IJISRT24SEP1684

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Heart disease is a growing global concern, affecting people across various age groups and genders. Detecting heart failure early is crucial, and ongoing research leverages advancements in healthcare technology, machine learning, imaging techniques, and data science to analyze vast datasets for this purpose. However, not all data attributes contribute equally to diagnosing heart disease, and the inclusion of irrelevant features can increase resource demands and potentially lead to inaccurate predictions with fatal consequences. This study focuses on feature extraction and reduction techniques to identify the most critical attributes for heart disease diagnosis, balancing resource efficiency with diagnostic accuracy. Using a dataset from the UCI repository, which includes both continuous and categorical features, we standardize the data and split it into training and testing sets in an 80:20 ratio. We then apply feature selection techniques to machine learning models such as K-nearest neighbor, decision tree classifier, SVM, logistic regression, and random forest. The models' predictive performance is evaluated using confusion matrices and ROC curves, demonstrating the impact of feature selection on diagnostic accuracy.

References :

  1. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
  2. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
  3. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324.
  4. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
  5. Shouman, M., Turner, T., & Stocker, R. (2012). Applying k-nearest neighbour in diagnosing heart disease patients. International Journal of Information and Education Technology, 2(3), 220-223.
  6. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., ... & Guppy, K. H. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology, 64(5), 304-310.
  7. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. John Wiley & Sons.
  8. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
  9. Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
  10. Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences.
  11. Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [Heart Disease Data Set]. Irvine, CA: University of California, School of Information and Computer Science. Available from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
  12. Kohavi, R., & John, G. H. (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2), 273-324.
  13. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  14. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression. Wiley. doi:10.1002/9781118548387
  15. Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81-106. doi:10.1007/BF00116251
  16. Cover, T. M., & Hart, P. E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1), 21-27. doi:10.1109/TIT.1967.1053964
  17. Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297. doi:10.1007/BF00994018

Heart disease is a growing global concern, affecting people across various age groups and genders. Detecting heart failure early is crucial, and ongoing research leverages advancements in healthcare technology, machine learning, imaging techniques, and data science to analyze vast datasets for this purpose. However, not all data attributes contribute equally to diagnosing heart disease, and the inclusion of irrelevant features can increase resource demands and potentially lead to inaccurate predictions with fatal consequences. This study focuses on feature extraction and reduction techniques to identify the most critical attributes for heart disease diagnosis, balancing resource efficiency with diagnostic accuracy. Using a dataset from the UCI repository, which includes both continuous and categorical features, we standardize the data and split it into training and testing sets in an 80:20 ratio. We then apply feature selection techniques to machine learning models such as K-nearest neighbor, decision tree classifier, SVM, logistic regression, and random forest. The models' predictive performance is evaluated using confusion matrices and ROC curves, demonstrating the impact of feature selection on diagnostic accuracy.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe