Using Mahout Library for Clustering Algorithm: A Case Study on Healthcare Data


Authors : Dr. Divya Chauhan; Dr. Satpal

Volume/Issue : Volume 10 - 2025, Issue 11 - November


Google Scholar : https://tinyurl.com/4y9pz4nd

Scribd : https://tinyurl.com/58brdnm6

DOI : https://doi.org/10.38124/ijisrt/25nov1348

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : Data mining techniques and algorithms worked excellently with small datasets. Data mining algorithms analysed bulk data to identify trends and draw conclusions. But most data mining tool is not efficient to process very large dataset which is the case in big data. They are not able to give quick outcomes in quick time, unless the computational tasks are run on multiple machines distributed over cloud. For process large volume of data like big data, Hadoop has adopted a new set of library for machine learning called Mahout. This paper deals with the clustering algorithms with the help of mahout library in Hadoop MapReduce environment. The real-world healthcare dataset is used which is quite large in size. The three clustering algorithms used are canopy clustering, K-Means clustering and fuzzy K-Means clustering.

Keywords : Data Mining, Big Data, Hadoop, Mahout, Clustering, Healthcare.

References :

  1. Prachi Surwade, Prof. Satish S. Banait, “A Survey on Clustering Techniques for Mining Big Data”, International Journal of Advanced Research in Science and Management, Feburary 2016, 2(2)
  2. Apache Mahout: https://mahout.apache.org/
  3. T. Sajana, C. M. Sheela Rani and K. V. Narayana, “A Survey on Clustering Techniques for Big Data Mining”, Indian Journal of Science and Technology, January, 2016,9(3)
  4. Miss. Harshada S. Deshmukh, Prof. P. L. Ramteke, “Comparing the Techniques of Cluster Analysis for Big Data”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET, December 2015, 4(12)
  5. Keshav Sanse, Meena Sharma, “Clustering methods for Big data analysis”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), March 2015, 4(3)
  6. Dr. Venkateswara Reddy Eluri, MS. Amina Salim Mohd AL-Jabri, Dr. M. Ramesh, Dr. Mare Jane, “A Comparative Study of Various Clustering Techniques on Big Data Sets using Apache Mahout”, 3rd MEC International Conference on Big Data and Smart City,2016
  7. Fatos Xhafa, Adriana Bogza, Santi Caballé, “Performance Evaluation of Mahout Clustering Algorithms Using a Twitter Streaming Dataset” IEEE 31st International Conference on Advanced Information Networking and Applications, 2017
  8. Van-Dai Ta, Chuan-Ming Liu, Goodwill Wandile Nkabinde, “Big Data Stream Computing in Healthcare Real-Time Analytics” IEEE International Conference on Cloud Computing and Big Data Analysis, 2016
  9. Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering Wikipedia’s latest articles: A comparison between k-means and fuzzy c-means in the cloud” Third IEEE International Conference on Cloud Computing Technology and Science, 2011
  10. Ahmad Al-Khoder, Hazar Harmouch, “Evaluating four of the most popular Open Source and Free Data Mining Tools” IJASR International Journal of Academic Scientific Research, 2015, 3(1)
  11. Hoda A. Abdel Hafez, “Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity” Dubai UAE Jan 28-29, 2016
  12. Amini A, Wah TY, Saboohi H. On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology, Jan. 2014,29(1):116-141
  13. Olga Kurasova, Virginijus Marcinkevicius, Viktor Medvedev, Aurimas Rapecka, and Pavel Stefanovi, “Strategies for Big Data Clustering” IEEE 26th International Conference on Tools with Artificial Intelligence, 2014
  14. Pritika Talwar, Shubham, Komalpreet Kaur, “Exploring Clustering techniques in Machine Learning”, International Journal of Creative Research Thoughts (IJCRT), March 2024,12(3)
  15. Aasim Ayaz Wani, “Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions” Peer J Computer science, https://doi.org/10.7717/peerj-cs.2286, August 2024
  16. Anju Parmar, Divya Chauhan, Dr. K.L. Bansal, “Performance Evaluation of Weka Clustering Algorithms on Large Datasets” International Journal of Advanced Research, 2017,5(6), 2209-2216
  17. Annual Reports: https://www.aiims.edu/en/about-us/annual-reports.html: accessed on: 20th january, 2019
  18. Tmc-Annual Report: https://tmc.gov.in/index.php/tmc-annual-report, Accessed on: 21st august, 2019
  19. Fortis bmw reports: https://www.fortismalar.com/bmw-report, accessed on: 30th January, 2019
  20. Apollo Hospitals: https://www.apollohospitals.com/corporate/investor-relations/financial-reports, accessed on 20th January, 2019
  21. Linux Uprising: https://www.linuxuprising.com/2019/05/how-to-convert-pdf-to-text-on-linux-gui.html
  22. Clustering your data: https://mahout.apache.org/users/clustering/clusteringyourdata.html

Data mining techniques and algorithms worked excellently with small datasets. Data mining algorithms analysed bulk data to identify trends and draw conclusions. But most data mining tool is not efficient to process very large dataset which is the case in big data. They are not able to give quick outcomes in quick time, unless the computational tasks are run on multiple machines distributed over cloud. For process large volume of data like big data, Hadoop has adopted a new set of library for machine learning called Mahout. This paper deals with the clustering algorithms with the help of mahout library in Hadoop MapReduce environment. The real-world healthcare dataset is used which is quite large in size. The three clustering algorithms used are canopy clustering, K-Means clustering and fuzzy K-Means clustering.

Keywords : Data Mining, Big Data, Hadoop, Mahout, Clustering, Healthcare.

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe