Authors :
T. Madhu; M. Mallikarjun; P. Charan Teja; K. Rahitya
Volume/Issue :
Volume 10 - 2025, Issue 5 - May
Google Scholar :
https://tinyurl.com/3uv46m67
DOI :
https://doi.org/10.38124/ijisrt/25may1946
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This project explores a semantic-based document clustering system designed to group documents based on the
similarity of their content. Unlike traditional keyword-based methods, which rely solely on word frequency, this system
leverages Natural Language Processing (NLP) to understand and compare the semantic meaning within documents. Using
pre-trained language models such as BERT and Sentence-BERT, each document is converted into a dense vector
representation that captures its underlying meaning. These vectors enable precise comparison of documents’ semantic
content, allowing for more accurate clustering. The project employs clustering algorithms such as K-Means and DBSCAN,
which group documents into clusters based on similarity. Cosine similarity further ensures that related documents are
accurately clustered together. Experimental results demonstrate that this approach produces more coherent and
contextually relevant clusters compared to traditional techniques, making it an effective solution for applications in content
organization, topic analysis, and information retrieval.
Keywords :
Semantic Document Clustering, NLP, BERT Embeddings, Sentence-BERT, Document Similarity, Content-Based Clustering, Cosine Similarity, K-Means, DBSCAN, Vector Representation, Topic Analysis, Information Retrieval, Dense Vector Embeddings, Pre-trained Language Models, Contextual Clustering.
References :
- M. REIMER M., DODGE J., GILMER J., HOFFMAN M.D., DREDZE M. Sentence-level representations for document classification. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, USA, 2019, pp. 700–707, doi: 10.18653/v1/N19-1070.
- DEVLIN J., CHANG M.W., LEE K., TOUTANOVA K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, pp. 4171–4186, doi: 10.48550/arXiv.1810.04805.
- REIMERS N., GUREVYCH I. Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, 2019, pp. 3982–3992, doi: 10.48550/arXiv.1908.10084.
- AGGARWAL C.C., ZHAI C.X. A Survey of Text Clustering Algorithms. Mining Text Data, Springer, Boston, MA, 2012, pp. 77–128, doi: 10.1007/978-1-4614-3223-4_4.
- ESTER M., KRIEGEL H.P., SANDER J., XU X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland,1996, pp. 226–231.
- HAN J., PEI J., KAMDAR M. Data Mining: Concepts and Techniques. Elsevier, 4th Edition, 2022, pp. 493–508, ISBN: 978-0-12-818148-7.
- LIU J., SHEN X., PAN W., LIU B. Document clustering via topic modeling using BERT embeddings. Proceedings of the 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Beijing, 2020, pp. 188–192, doi: 10.1109/ICAICE51518.2020.00047.
- ZHAO W.X., GUO Y., HE Y. A Comparative Study of Deep Learning Models for Semantic Document Clustering. Information Sciences, 2021, 576, pp. 55–72, doi: 10.1016/j.ins.2021.07.055.
This project explores a semantic-based document clustering system designed to group documents based on the
similarity of their content. Unlike traditional keyword-based methods, which rely solely on word frequency, this system
leverages Natural Language Processing (NLP) to understand and compare the semantic meaning within documents. Using
pre-trained language models such as BERT and Sentence-BERT, each document is converted into a dense vector
representation that captures its underlying meaning. These vectors enable precise comparison of documents’ semantic
content, allowing for more accurate clustering. The project employs clustering algorithms such as K-Means and DBSCAN,
which group documents into clusters based on similarity. Cosine similarity further ensures that related documents are
accurately clustered together. Experimental results demonstrate that this approach produces more coherent and
contextually relevant clusters compared to traditional techniques, making it an effective solution for applications in content
organization, topic analysis, and information retrieval.
Keywords :
Semantic Document Clustering, NLP, BERT Embeddings, Sentence-BERT, Document Similarity, Content-Based Clustering, Cosine Similarity, K-Means, DBSCAN, Vector Representation, Topic Analysis, Information Retrieval, Dense Vector Embeddings, Pre-trained Language Models, Contextual Clustering.