Authors :
Hasan Mahmud; Mahmud Hasan; Farhana Ryhan Kabir; Md. Zahiruddin Aqib
Volume/Issue :
Volume 9 - 2024, Issue 10 - October
Google Scholar :
https://tinyurl.com/3y558fee
Scribd :
https://tinyurl.com/3tpz5bdh
DOI :
https://doi.org/10.5281/zenodo.14730649
Abstract :
Natural language processing (NLP) includes similarity analysis of words, phrases, or texts in the context of lexical analysis and
semantic analysis. Because Bangla is a language with few resources, this process is more difficult for this language. Different
types of methods are used to extract the similarity based on meaning. Compared to lexical similarity analysis, semantic similarity
analysis is more difficult. We primarily addressed the theoretical aspect of the semantic similarity analysis in this study. A small
number of approaches are investigated and found to be effective in identifying the similarities in the context of Bangla NLP study.
The corpus in Bangla WordNet is not generally available to work with. Bangla similarity is a concern based on research
conducted thus far with WordNet, LDA, LSA, Word2Vec, Doc2Vec, and WMD. We have reviewed all these techniques and
prepared a comparative study among them in this paper.
Keywords :
AI, WordNet, Jaccard Similarity; Semantic Textual Similarity; Statistical Similarity; Cosine Similarity; N-gram; Natural Language Processing; Character-based Similarity; Term-based Similarity.
References :
- Sheikh Abujar, Mahmudul Hasan, and Syed Akhter Hossain. 2019. Sentence similarity estimation for text summarization using deep learning. In Proceedings of the 2nd International Conference on Data Engineering and Communication Technology. Springer, 155–164.
- Munshi Asadullah. 2007. Finite state recognizer and string similarity based spelling checker for Bangla. Ph.D. Dissertation. BRAC University.
- Jeffrey Hays. [n.d.]. BENGALIS. http://factsanddetails.com/india/Minorities_Castes_and_Regions_in_India/sub7_4b/entry-4198.html
- Mustakim Al Helal. 2018. Topic Modelling and Sentiment Analysis with the Bangla Language: A Deep Learning Approach Combined with the Latent Dirichlet Allocation. Ph.D. Dissertation. Faculty of Graduate Studies and Research, University of Regina.
- Sabir Ismail and M Shahidur Rahman. 2014. Bangla word clustering based on N-gram language model. In 2014 International Conference on Electrical Engineering and Information & Communication Technology. IEEE, 1–5.
- Mohammad Shibli Kaysar and Mohammad Ibrahim Khan. 2018. Word sense disambiguation for bangla words using apriori algorithm. In International Conference on Recent Advances in Mathematical and Physical Sciences. 61.
- Yuhua Li, David McLean, Zuhair A Bandar, James D O’shea, and Keeley Crockett. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE transactions on knowledge and data engineering 18, 8 (2006), 1138–1150.
- Goutam Majumder, Partha Pakray, Alexander Gelbukh, and David Pinto. 2016. Semantic textual similarity methods, tools, and applications: A survey. Computación y Sistemas 20, 4 (2016), 647–665.
- Prianka Mandal and BM Mainul Hossain. 2017. A systematic literature review on spell checkers for bangla language. International Journal of Modern Education and Computer Science 9, 6 (2017), 40.
- Abu Kaisar Mohammad Masum, Sheikh Abujar, Raja Tariqul Hasan Tusher, Fahad Faisal, and Syed Akhter Hossain. 2019. Sentence Similarity Measurement for Bengali Abstractive Text Summarization. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 1–5.
- Abu Mohammad Masum, Sheikh Abujar, and Syed Hossain. 2019. Sentence Similarity Measurement for Bengali Abstractive Text Summa- rization. https://doi.org/10.1109/ICCCNT45670.2019.8944571
- Rabindra Nandi, M. Zaman, Tareq Muntasir, Sakhawat Sumit, and Md. Jamil-Ur Rahman. 2018. Bangla News Recommendation Using doc2vec. https://doi.org/10.1109/ICBSLP.2018.8554679
- Rajat Pandit, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, and Mohini Mohan Sardar. 2019. Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language. In Informatics, Vol. 6. Multidisciplinary Digital Publishing Institute, 19.
- Dwijen Rudrapal, Amitava Das, and Baby Bhattacharya. 2015. Measuring semantic similarity for bengali tweets using wordnet. In Proceed- ings of the International Conference Recent Advances in Natural Language Processing. 537–544.
- Nafiz Sadman, Akib Sadmanee, Md Tanveer, Md Ashraful Amin, and Amin Ali. 2019. Intrinsic Evaluation of Bangla Word Embeddings. 1–5. https://doi.org/10.1109/ICBSLP47725.2019.201506
- Md Shajalal and Masaki Aono. 2018. Semantic textual similarity in bengali text. In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 1–5.
- Manjira Sinha, Abhik Jana, Tirthankar Dasgupta, and Anupam Basu. 2012. A new semantic lexicon and similarity measure in bangla. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon. 171–182.
Natural language processing (NLP) includes similarity analysis of words, phrases, or texts in the context of lexical analysis and
semantic analysis. Because Bangla is a language with few resources, this process is more difficult for this language. Different
types of methods are used to extract the similarity based on meaning. Compared to lexical similarity analysis, semantic similarity
analysis is more difficult. We primarily addressed the theoretical aspect of the semantic similarity analysis in this study. A small
number of approaches are investigated and found to be effective in identifying the similarities in the context of Bangla NLP study.
The corpus in Bangla WordNet is not generally available to work with. Bangla similarity is a concern based on research
conducted thus far with WordNet, LDA, LSA, Word2Vec, Doc2Vec, and WMD. We have reviewed all these techniques and
prepared a comparative study among them in this paper.
Keywords :
AI, WordNet, Jaccard Similarity; Semantic Textual Similarity; Statistical Similarity; Cosine Similarity; N-gram; Natural Language Processing; Character-based Similarity; Term-based Similarity.