Semantic Document Clustering Using NLP


Authors : T. Madhu; M. Mallikarjun; P. Charan Teja; K. Rahitya

Volume/Issue : Volume 10 - 2025, Issue 5 - May


Google Scholar : https://tinyurl.com/3uv46m67

DOI : https://doi.org/10.38124/ijisrt/25may1946

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : This project explores a semantic-based document clustering system designed to group documents based on the similarity of their content. Unlike traditional keyword-based methods, which rely solely on word frequency, this system leverages Natural Language Processing (NLP) to understand and compare the semantic meaning within documents. Using pre-trained language models such as BERT and Sentence-BERT, each document is converted into a dense vector representation that captures its underlying meaning. These vectors enable precise comparison of documents’ semantic content, allowing for more accurate clustering. The project employs clustering algorithms such as K-Means and DBSCAN, which group documents into clusters based on similarity. Cosine similarity further ensures that related documents are accurately clustered together. Experimental results demonstrate that this approach produces more coherent and contextually relevant clusters compared to traditional techniques, making it an effective solution for applications in content organization, topic analysis, and information retrieval.

Keywords : Semantic Document Clustering, NLP, BERT Embeddings, Sentence-BERT, Document Similarity, Content-Based Clustering, Cosine Similarity, K-Means, DBSCAN, Vector Representation, Topic Analysis, Information Retrieval, Dense Vector Embeddings, Pre-trained Language Models, Contextual Clustering.

References :

  1. M. REIMER M., DODGE J., GILMER J., HOFFMAN M.D., DREDZE M. Sentence-level representations for document classification. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, USA, 2019, pp. 700–707, doi: 10.18653/v1/N19-1070.
  2. DEVLIN J., CHANG M.W., LEE K., TOUTANOVA K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, pp. 4171–4186, doi: 10.48550/arXiv.1810.04805.
  3. REIMERS N., GUREVYCH I. Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, 2019, pp. 3982–3992, doi: 10.48550/arXiv.1908.10084.
  4. AGGARWAL C.C., ZHAI C.X. A Survey of Text Clustering Algorithms. Mining Text Data, Springer, Boston, MA, 2012, pp. 77–128, doi: 10.1007/978-1-4614-3223-4_4.
  5. ESTER M., KRIEGEL H.P., SANDER J., XU X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland,1996, pp. 226–231.
  6. HAN J., PEI J., KAMDAR M. Data Mining: Concepts and Techniques. Elsevier, 4th Edition, 2022, pp. 493–508, ISBN: 978-0-12-818148-7.
  7. LIU J., SHEN X., PAN W., LIU B. Document clustering via topic modeling using BERT embeddings. Proceedings of the 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Beijing, 2020, pp. 188–192, doi: 10.1109/ICAICE51518.2020.00047.
  8. ZHAO W.X., GUO Y., HE Y. A Comparative Study of Deep Learning Models for Semantic Document Clustering. Information Sciences, 2021, 576, pp. 55–72, doi: 10.1016/j.ins.2021.07.055.

This project explores a semantic-based document clustering system designed to group documents based on the similarity of their content. Unlike traditional keyword-based methods, which rely solely on word frequency, this system leverages Natural Language Processing (NLP) to understand and compare the semantic meaning within documents. Using pre-trained language models such as BERT and Sentence-BERT, each document is converted into a dense vector representation that captures its underlying meaning. These vectors enable precise comparison of documents’ semantic content, allowing for more accurate clustering. The project employs clustering algorithms such as K-Means and DBSCAN, which group documents into clusters based on similarity. Cosine similarity further ensures that related documents are accurately clustered together. Experimental results demonstrate that this approach produces more coherent and contextually relevant clusters compared to traditional techniques, making it an effective solution for applications in content organization, topic analysis, and information retrieval.

Keywords : Semantic Document Clustering, NLP, BERT Embeddings, Sentence-BERT, Document Similarity, Content-Based Clustering, Cosine Similarity, K-Means, DBSCAN, Vector Representation, Topic Analysis, Information Retrieval, Dense Vector Embeddings, Pre-trained Language Models, Contextual Clustering.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe