Authors :
Ayan Hussain; Moh Zaid Khan; Rayyan Arif Hussain; Abdul Ahad; Ambreen Anees
Volume/Issue :
Volume 10 - 2025, Issue 4 - April
Google Scholar :
https://tinyurl.com/bx6zsn36
DOI :
https://doi.org/10.38124/ijisrt/25apr1961
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Document similarity plays a pivotal role in the field of Natural Language Processing (NLP), especially in tasks
that require identifying the degree of relatedness between textual content. This paper presents a comprehensive study and
implementation of document similarity techniques using the Java programming language, with a focus on practical NLP
approaches. The motivation behind this work stems from real-world applications such as plagiarism detection, content
recommendation systems, semantic search engines, and automated document classification. The system developed in this
research employs a multi-step NLP pipeline beginning with data preprocessing. This includes default procedures such as
text normalizing, tokenizing, stop word removal, and optional stemming or lemmatization. Following post-preprocessing,
documents are converted into numerical vectors using the Term Frequency–Inverse Document Frequency (TF-IDF)
weighting scheme, which determines how important terms are in each document in relation to the collection as a
whole.Since cosine similarity is effective at comparing text-based vectors in a high-dimensional space, it is used to evaluate
similarity among document vectors.
Keywords :
Natural Language Processing, Document Similarity, TF-IDF, Cosine Similarity, Java, Text Mining, Information Retrieval, Plagiarism Detection.
References :
- G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, 1988.
- C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
- T. Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv:1301.3781, 2013.
- J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
- Apache OpenNLP Documentation. [Online]. Available: https://opennlp.apache.org. Accessed on: Apr. 22, 2025.
- S. Balaji and P. Vikram, Natural Language Processing with Java, Packt Publishing, 2018.
Document similarity plays a pivotal role in the field of Natural Language Processing (NLP), especially in tasks
that require identifying the degree of relatedness between textual content. This paper presents a comprehensive study and
implementation of document similarity techniques using the Java programming language, with a focus on practical NLP
approaches. The motivation behind this work stems from real-world applications such as plagiarism detection, content
recommendation systems, semantic search engines, and automated document classification. The system developed in this
research employs a multi-step NLP pipeline beginning with data preprocessing. This includes default procedures such as
text normalizing, tokenizing, stop word removal, and optional stemming or lemmatization. Following post-preprocessing,
documents are converted into numerical vectors using the Term Frequency–Inverse Document Frequency (TF-IDF)
weighting scheme, which determines how important terms are in each document in relation to the collection as a
whole.Since cosine similarity is effective at comparing text-based vectors in a high-dimensional space, it is used to evaluate
similarity among document vectors.
Keywords :
Natural Language Processing, Document Similarity, TF-IDF, Cosine Similarity, Java, Text Mining, Information Retrieval, Plagiarism Detection.