Authors :
Ravinder Kaur; Sonia Rani; Chitra Desai; Sagar Jambhorkar
Volume/Issue :
Volume 9 - 2024, Issue 9 - September
Google Scholar :
https://tinyurl.com/yn4ssbfw
Scribd :
https://tinyurl.com/bde9268u
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24SEP1684
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Heart disease is a growing global concern,
affecting people across various age groups and genders.
Detecting heart failure early is crucial, and ongoing
research leverages advancements in healthcare
technology, machine learning, imaging techniques, and
data science to analyze vast datasets for this purpose.
However, not all data attributes contribute equally to
diagnosing heart disease, and the inclusion of irrelevant
features can increase resource demands and potentially
lead to inaccurate predictions with fatal consequences.
This study focuses on feature extraction and reduction
techniques to identify the most critical attributes for heart
disease diagnosis, balancing resource efficiency with
diagnostic accuracy. Using a dataset from the UCI
repository, which includes both continuous and
categorical features, we standardize the data and split it
into training and testing sets in an 80:20 ratio. We then
apply feature selection techniques to machine learning
models such as K-nearest neighbor, decision tree
classifier, SVM, logistic regression, and random forest.
The models' predictive performance is evaluated using
confusion matrices and ROC curves, demonstrating the
impact of feature selection on diagnostic accuracy.
References :
- Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
- Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
- Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324.
- Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
- Shouman, M., Turner, T., & Stocker, R. (2012). Applying k-nearest neighbour in diagnosing heart disease patients. International Journal of Information and Education Technology, 2(3), 220-223.
- Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., ... & Guppy, K. H. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology, 64(5), 304-310.
- Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. John Wiley & Sons.
- Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
- Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
- Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences.
- Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [Heart Disease Data Set]. Irvine, CA: University of California, School of Information and Computer Science. Available from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
- Kohavi, R., & John, G. H. (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2), 273-324.
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression. Wiley. doi:10.1002/9781118548387
- Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81-106. doi:10.1007/BF00116251
- Cover, T. M., & Hart, P. E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1), 21-27. doi:10.1109/TIT.1967.1053964
- Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297. doi:10.1007/BF00994018
Heart disease is a growing global concern,
affecting people across various age groups and genders.
Detecting heart failure early is crucial, and ongoing
research leverages advancements in healthcare
technology, machine learning, imaging techniques, and
data science to analyze vast datasets for this purpose.
However, not all data attributes contribute equally to
diagnosing heart disease, and the inclusion of irrelevant
features can increase resource demands and potentially
lead to inaccurate predictions with fatal consequences.
This study focuses on feature extraction and reduction
techniques to identify the most critical attributes for heart
disease diagnosis, balancing resource efficiency with
diagnostic accuracy. Using a dataset from the UCI
repository, which includes both continuous and
categorical features, we standardize the data and split it
into training and testing sets in an 80:20 ratio. We then
apply feature selection techniques to machine learning
models such as K-nearest neighbor, decision tree
classifier, SVM, logistic regression, and random forest.
The models' predictive performance is evaluated using
confusion matrices and ROC curves, demonstrating the
impact of feature selection on diagnostic accuracy.