A Systematic Literature Review of Similarity Analysis Techniques for Bangla Text


Authors : Hasan Mahmud; Mahmud Hasan; Farhana Ryhan Kabir; Md. Zahiruddin Aqib

Volume/Issue : Volume 9 - 2024, Issue 10 - October


Google Scholar : https://tinyurl.com/3y558fee

Scribd : https://tinyurl.com/3tpz5bdh

DOI : https://doi.org/10.5281/zenodo.14730649


Abstract : Natural language processing (NLP) includes similarity analysis of words, phrases, or texts in the context of lexical analysis and semantic analysis. Because Bangla is a language with few resources, this process is more difficult for this language. Different types of methods are used to extract the similarity based on meaning. Compared to lexical similarity analysis, semantic similarity analysis is more difficult. We primarily addressed the theoretical aspect of the semantic similarity analysis in this study. A small number of approaches are investigated and found to be effective in identifying the similarities in the context of Bangla NLP study. The corpus in Bangla WordNet is not generally available to work with. Bangla similarity is a concern based on research conducted thus far with WordNet, LDA, LSA, Word2Vec, Doc2Vec, and WMD. We have reviewed all these techniques and prepared a comparative study among them in this paper.

Keywords : AI, WordNet, Jaccard Similarity; Semantic Textual Similarity; Statistical Similarity; Cosine Similarity; N-gram; Natural Language Processing; Character-based Similarity; Term-based Similarity.

References :

  1. Sheikh Abujar, Mahmudul Hasan, and Syed Akhter Hossain. 2019. Sentence similarity estimation for text summarization using deep learning. In Proceedings of the 2nd International Conference on Data Engineering and Communication Technology. Springer, 155–164.
  2. Munshi Asadullah. 2007. Finite state recognizer and string similarity based spelling checker for Bangla. Ph.D. Dissertation. BRAC University.
  3. Jeffrey Hays. [n.d.]. BENGALIS.  http://factsanddetails.com/india/Minorities_Castes_and_Regions_in_India/sub7_4b/entry-4198.html
  4. Mustakim Al Helal. 2018. Topic Modelling and Sentiment Analysis with the Bangla Language: A Deep Learning Approach Combined with the Latent Dirichlet Allocation. Ph.D. Dissertation. Faculty of Graduate Studies and Research, University of Regina.
  5. Sabir Ismail and M Shahidur Rahman. 2014. Bangla word clustering based on N-gram language model. In 2014 International Conference on Electrical Engineering and Information & Communication Technology. IEEE, 1–5.
  6. Mohammad Shibli Kaysar and Mohammad Ibrahim Khan. 2018. Word sense disambiguation for bangla words using apriori algorithm. In International Conference on Recent Advances in Mathematical and Physical Sciences. 61.
  7. Yuhua Li, David McLean, Zuhair A Bandar, James D O’shea, and Keeley Crockett. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE transactions on knowledge and data engineering 18, 8 (2006), 1138–1150.
  8. Goutam Majumder, Partha Pakray, Alexander Gelbukh, and David Pinto. 2016. Semantic textual similarity methods, tools, and applications: A survey. Computación y Sistemas 20, 4 (2016), 647–665.
  9. Prianka Mandal and BM Mainul Hossain. 2017. A systematic literature review on spell checkers for bangla language. International Journal of Modern Education and Computer Science 9, 6 (2017), 40.
  10. Abu Kaisar Mohammad Masum, Sheikh Abujar, Raja Tariqul Hasan Tusher, Fahad Faisal, and Syed Akhter Hossain. 2019. Sentence Similarity Measurement for Bengali Abstractive Text Summarization. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 1–5.
  11. Abu Mohammad Masum, Sheikh Abujar, and Syed Hossain. 2019. Sentence Similarity Measurement for Bengali Abstractive Text Summa- rization. https://doi.org/10.1109/ICCCNT45670.2019.8944571
  12. Rabindra Nandi, M. Zaman, Tareq Muntasir, Sakhawat Sumit, and Md. Jamil-Ur Rahman. 2018. Bangla News Recommendation Using doc2vec. https://doi.org/10.1109/ICBSLP.2018.8554679
  13. Rajat Pandit, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, and Mohini Mohan Sardar. 2019. Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language. In Informatics, Vol. 6. Multidisciplinary Digital Publishing Institute, 19.
  14. Dwijen Rudrapal, Amitava Das, and Baby Bhattacharya. 2015. Measuring semantic similarity for bengali tweets using wordnet. In Proceed- ings of the International Conference Recent Advances in Natural Language Processing. 537–544.
  15. Nafiz Sadman, Akib Sadmanee, Md Tanveer, Md Ashraful Amin, and Amin Ali. 2019. Intrinsic Evaluation of Bangla Word Embeddings. 1–5. https://doi.org/10.1109/ICBSLP47725.2019.201506
  16. Md Shajalal and Masaki Aono. 2018. Semantic textual similarity in bengali text. In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 1–5.
  17. Manjira Sinha, Abhik Jana, Tirthankar Dasgupta, and Anupam Basu. 2012. A new semantic lexicon and similarity measure in bangla. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon. 171–182.

Natural language processing (NLP) includes similarity analysis of words, phrases, or texts in the context of lexical analysis and semantic analysis. Because Bangla is a language with few resources, this process is more difficult for this language. Different types of methods are used to extract the similarity based on meaning. Compared to lexical similarity analysis, semantic similarity analysis is more difficult. We primarily addressed the theoretical aspect of the semantic similarity analysis in this study. A small number of approaches are investigated and found to be effective in identifying the similarities in the context of Bangla NLP study. The corpus in Bangla WordNet is not generally available to work with. Bangla similarity is a concern based on research conducted thus far with WordNet, LDA, LSA, Word2Vec, Doc2Vec, and WMD. We have reviewed all these techniques and prepared a comparative study among them in this paper.

Keywords : AI, WordNet, Jaccard Similarity; Semantic Textual Similarity; Statistical Similarity; Cosine Similarity; N-gram; Natural Language Processing; Character-based Similarity; Term-based Similarity.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe