Authors :
Dr. Divya Chauhan; Dr. Satpal
Volume/Issue :
Volume 10 - 2025, Issue 11 - November
Google Scholar :
https://tinyurl.com/4y9pz4nd
Scribd :
https://tinyurl.com/58brdnm6
DOI :
https://doi.org/10.38124/ijisrt/25nov1348
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
Data mining techniques and algorithms worked excellently with small datasets. Data mining algorithms
analysed bulk data to identify trends and draw conclusions. But most data mining tool is not efficient to process very large
dataset which is the case in big data. They are not able to give quick outcomes in quick time, unless the computational
tasks are run on multiple machines distributed over cloud. For process large volume of data like big data, Hadoop has
adopted a new set of library for machine learning called Mahout.
This paper deals with the clustering algorithms with the help of mahout library in Hadoop MapReduce environment.
The real-world healthcare dataset is used which is quite large in size. The three clustering algorithms used are canopy
clustering, K-Means clustering and fuzzy K-Means clustering.
Keywords :
Data Mining, Big Data, Hadoop, Mahout, Clustering, Healthcare.
References :
- Prachi Surwade, Prof. Satish S. Banait, “A Survey on Clustering Techniques for Mining Big Data”, International Journal of Advanced Research in Science and Management, Feburary 2016, 2(2)
- Apache Mahout: https://mahout.apache.org/
- T. Sajana, C. M. Sheela Rani and K. V. Narayana, “A Survey on Clustering Techniques for Big Data Mining”, Indian Journal of Science and Technology, January, 2016,9(3)
- Miss. Harshada S. Deshmukh, Prof. P. L. Ramteke, “Comparing the Techniques of Cluster Analysis for Big Data”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET, December 2015, 4(12)
- Keshav Sanse, Meena Sharma, “Clustering methods for Big data analysis”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), March 2015, 4(3)
- Dr. Venkateswara Reddy Eluri, MS. Amina Salim Mohd AL-Jabri, Dr. M. Ramesh, Dr. Mare Jane, “A Comparative Study of Various Clustering Techniques on Big Data Sets using Apache Mahout”, 3rd MEC International Conference on Big Data and Smart City,2016
- Fatos Xhafa, Adriana Bogza, Santi Caballé, “Performance Evaluation of Mahout Clustering Algorithms Using a Twitter Streaming Dataset” IEEE 31st International Conference on Advanced Information Networking and Applications, 2017
- Van-Dai Ta, Chuan-Ming Liu, Goodwill Wandile Nkabinde, “Big Data Stream Computing in Healthcare Real-Time Analytics” IEEE International Conference on Cloud Computing and Big Data Analysis, 2016
- Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering Wikipedia’s latest articles: A comparison between k-means and fuzzy c-means in the cloud” Third IEEE International Conference on Cloud Computing Technology and Science, 2011
- Ahmad Al-Khoder, Hazar Harmouch, “Evaluating four of the most popular Open Source and Free Data Mining Tools” IJASR International Journal of Academic Scientific Research, 2015, 3(1)
- Hoda A. Abdel Hafez, “Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity” Dubai UAE Jan 28-29, 2016
- Amini A, Wah TY, Saboohi H. On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology, Jan. 2014,29(1):116-141
- Olga Kurasova, Virginijus Marcinkevicius, Viktor Medvedev, Aurimas Rapecka, and Pavel Stefanovi, “Strategies for Big Data Clustering” IEEE 26th International Conference on Tools with Artificial Intelligence, 2014
- Pritika Talwar, Shubham, Komalpreet Kaur, “Exploring Clustering techniques in Machine Learning”, International Journal of Creative Research Thoughts (IJCRT), March 2024,12(3)
- Aasim Ayaz Wani, “Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions” Peer J Computer science, https://doi.org/10.7717/peerj-cs.2286, August 2024
- Anju Parmar, Divya Chauhan, Dr. K.L. Bansal, “Performance Evaluation of Weka Clustering Algorithms on Large Datasets” International Journal of Advanced Research, 2017,5(6), 2209-2216
- Annual Reports: https://www.aiims.edu/en/about-us/annual-reports.html: accessed on: 20th january, 2019
- Tmc-Annual Report: https://tmc.gov.in/index.php/tmc-annual-report, Accessed on: 21st august, 2019
- Fortis bmw reports: https://www.fortismalar.com/bmw-report, accessed on: 30th January, 2019
- Apollo Hospitals: https://www.apollohospitals.com/corporate/investor-relations/financial-reports, accessed on 20th January, 2019
- Linux Uprising: https://www.linuxuprising.com/2019/05/how-to-convert-pdf-to-text-on-linux-gui.html
- Clustering your data: https://mahout.apache.org/users/clustering/clusteringyourdata.html
Data mining techniques and algorithms worked excellently with small datasets. Data mining algorithms
analysed bulk data to identify trends and draw conclusions. But most data mining tool is not efficient to process very large
dataset which is the case in big data. They are not able to give quick outcomes in quick time, unless the computational
tasks are run on multiple machines distributed over cloud. For process large volume of data like big data, Hadoop has
adopted a new set of library for machine learning called Mahout.
This paper deals with the clustering algorithms with the help of mahout library in Hadoop MapReduce environment.
The real-world healthcare dataset is used which is quite large in size. The three clustering algorithms used are canopy
clustering, K-Means clustering and fuzzy K-Means clustering.
Keywords :
Data Mining, Big Data, Hadoop, Mahout, Clustering, Healthcare.