Authors :
Md. Abu Horaira Sarder
Volume/Issue :
Volume 10 - 2025, Issue 12 - December
Google Scholar :
https://tinyurl.com/mrj7ypb9
Scribd :
https://tinyurl.com/3wdczmts
DOI :
https://doi.org/10.38124/ijisrt/25dec1243
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Bangladesh and its adjoining regions exhibit extensive linguistic diversity, comprising numerous regional languages
and dialects that remain underrepresented in digital communication systems. The absence of standardized translation
frameworks for these regional varieties poses substantial barriers to information accessibility, knowledge dissemination, and
inclusive technological development. This study proposes an NLP-based computational model for systematically translating
regional languages into Standard Bangla, thereby addressing the linguistic gap between informal spoken varieties and formal
written Bangla. The research methodology encompasses corpus development, data annotation, text normalization, tokenization,
phonological mapping, and the application of machine-learning and sequence-to-sequence translation architectures. A parallel
dataset consisting of region-specific lexical items, syntactic structures, and semantic patterns was constructed to train and
evaluate the system. Experimental evaluation indicates that the proposed model achieves promising translation accuracy while
preserving semantic integrity and contextual meaning. The findings highlight the system's potential to support language
standardization, promote linguistic inclusivity, and facilitate broader digital participation among speakers of marginalized
dialects. The study further advances localized NLP research in Bangladesh and provides a foundation for future extensions to
educational technology, governmental communication platforms, and multilingual AI systems.
References :
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems (NeurIPS), pp. 3104–3112, 2014.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. on Learning Representations (ICLR), 2015.
- M. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1412–1421, 2015.
- A. Vaswani et al., “Attention is all you need,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017.
- I. Goodfellow et al., “Generative adversarial nets,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680, 2014.
- G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using monolingual corpora only,” in Proc. Int. Conf. on Learning Representations (ICLR), 2018.
- P. Koehn, Statistical Machine Translation. Cambridge, U.K.: Cambridge Univ. Press, 2010.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, pp. 4171–4186, 2019.
- A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), pp. 8440–8451, 2020.
- M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proc. ACL, pp. 7871–7880, 2020.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. ACL, pp. 311–318, 2002.
- C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop on Text Summarization Branches Out, pp. 74–81, 2004.
- S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop, pp. 65–72, 2005.
- C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008.
- M. Hasan and M. S. Islam, “Bangla language processing: A survey,” Journal of Information and Communication Technology, vol. 19, no. 2, pp. 123–145, 2020.
- S. A. Chowdhury and M. J. Alam, “Neural machine translation for Bangla–English using transformer architecture,” Int. J. Comput. Appl., vol. 176, no. 10, pp. 1–7, 2019.
- A. Bhattacharjee, M. S. Rahman, and S. Sarker, “BanglaBERT: Language model pretraining for Bangla language processing,” arXiv preprint arXiv:2101.00204, 2021.
- M. J. Islam, M. T. Taher, and S. Paul, “Vashantor: A multilingual benchmark dataset for Bangla regional dialect translation,” arXiv preprint arXiv:2303.XXXXX, 2023.
- A. Rahman and M. S. Islam, “Computational challenges in Bangla dialect processing,” Dhaka Univ. J. Linguistics, vol. 15, no. 1, pp. 45–60, 2022.
- A. H. Author, “Regional Language to Bangla: Joypurhat dialect dataset,” Self-compiled dataset, Rajshahi Division, Bangladesh, 2025.
Bangladesh and its adjoining regions exhibit extensive linguistic diversity, comprising numerous regional languages
and dialects that remain underrepresented in digital communication systems. The absence of standardized translation
frameworks for these regional varieties poses substantial barriers to information accessibility, knowledge dissemination, and
inclusive technological development. This study proposes an NLP-based computational model for systematically translating
regional languages into Standard Bangla, thereby addressing the linguistic gap between informal spoken varieties and formal
written Bangla. The research methodology encompasses corpus development, data annotation, text normalization, tokenization,
phonological mapping, and the application of machine-learning and sequence-to-sequence translation architectures. A parallel
dataset consisting of region-specific lexical items, syntactic structures, and semantic patterns was constructed to train and
evaluate the system. Experimental evaluation indicates that the proposed model achieves promising translation accuracy while
preserving semantic integrity and contextual meaning. The findings highlight the system's potential to support language
standardization, promote linguistic inclusivity, and facilitate broader digital participation among speakers of marginalized
dialects. The study further advances localized NLP research in Bangladesh and provides a foundation for future extensions to
educational technology, governmental communication platforms, and multilingual AI systems.