Harmonizing multilingual product data using machine learning a case study of the rwanda revenue authority| International Journal of Innovative Science and Research Technology

Harmonizing Multilingual Product Data Using Machine Learning: A Case Study of the Rwanda Revenue Authority

Authors : Raymond Kamana

Volume/Issue : Volume 10 - 2025, Issue 8 - August

Google Scholar : https://tinyurl.com/bdz2csf7

Scribd : https://tinyurl.com/bdhwjecd

DOI : https://doi.org/10.38124/ijisrt/25aug118

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : This study focuses on solving the problem of inconsistent and multilingual product names in the Rwanda Revenue Authority’s (RRA) Electronic Billing Machine (EBM) system. Because product names are entered manually, many spelling differences and translations make it hard to track and analyze tax data. To fix this, the study uses Natural Language Processing (NLP) and Machine Learning (ML) to clean and group similar product names. A total of 4.1 million records from 2020 to 2022 were translated into English and processed. Sentence meaning was captured using MiniLM embeddings, then simplified using UMAP, and finally grouped using HDBSCAN. The cleaned and grouped product names make it easier to detect possible fraud, spot underpricing, and improve the accuracy of tax reporting. This method helps RRA improve data quality and tax compliance.

Keywords : Multilingual Harmonization, Natural Language Processing (NLP), Machine Learning (ML), Product Name Clustering, Sentence Embedding, MiniLM, HDBSCAN, KMeans, UMAP, Language Detection, Tax Data Quality, Rwanda Revenue Authority (RRA), Electronic Billing Machine (EBM), Anomaly Detection, Cross-lingual Processing.

References :

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.48550/arXiv.1810.04805
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.48550/arXiv.1908.10084
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426. https://doi.org/10.48550/arXiv.1802.03426
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining (pp. 160–172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141–188. https://doi.org/10.1613/jair.2934
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781
Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Journal of Artificial Intelligence Research, 63, 743–788. https://doi.org/10.1613/jair.1.11259

This study focuses on solving the problem of inconsistent and multilingual product names in the Rwanda Revenue Authority’s (RRA) Electronic Billing Machine (EBM) system. Because product names are entered manually, many spelling differences and translations make it hard to track and analyze tax data. To fix this, the study uses Natural Language Processing (NLP) and Machine Learning (ML) to clean and group similar product names. A total of 4.1 million records from 2020 to 2022 were translated into English and processed. Sentence meaning was captured using MiniLM embeddings, then simplified using UMAP, and finally grouped using HDBSCAN. The cleaned and grouped product names make it easier to detect possible fraud, spot underpricing, and improve the accuracy of tax reporting. This method helps RRA improve data quality and tax compliance.

Paper Submission Last Date
31 - March - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.