Authors :
Raymond Kamana
Volume/Issue :
Volume 10 - 2025, Issue 8 - August
Google Scholar :
https://tinyurl.com/bdz2csf7
Scribd :
https://tinyurl.com/bdhwjecd
DOI :
https://doi.org/10.38124/ijisrt/25aug118
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
This study focuses on solving the problem of inconsistent and multilingual product names in the Rwanda Revenue
Authority’s (RRA) Electronic Billing Machine (EBM) system. Because product names are entered manually, many spelling
differences and translations make it hard to track and analyze tax data. To fix this, the study uses Natural Language
Processing (NLP) and Machine Learning (ML) to clean and group similar product names. A total of 4.1 million records
from 2020 to 2022 were translated into English and processed. Sentence meaning was captured using MiniLM embeddings,
then simplified using UMAP, and finally grouped using HDBSCAN. The cleaned and grouped product names make it easier
to detect possible fraud, spot underpricing, and improve the accuracy of tax reporting. This method helps RRA improve
data quality and tax compliance.
Keywords :
Multilingual Harmonization, Natural Language Processing (NLP), Machine Learning (ML), Product Name Clustering, Sentence Embedding, MiniLM, HDBSCAN, KMeans, UMAP, Language Detection, Tax Data Quality, Rwanda Revenue Authority (RRA), Electronic Billing Machine (EBM), Anomaly Detection, Cross-lingual Processing.
References :
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.48550/arXiv.1810.04805
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.48550/arXiv.1908.10084
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426. https://doi.org/10.48550/arXiv.1802.03426
- Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining (pp. 160–172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
- Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141–188. https://doi.org/10.1613/jair.2934
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781
- Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Journal of Artificial Intelligence Research, 63, 743–788. https://doi.org/10.1613/jair.1.11259
This study focuses on solving the problem of inconsistent and multilingual product names in the Rwanda Revenue
Authority’s (RRA) Electronic Billing Machine (EBM) system. Because product names are entered manually, many spelling
differences and translations make it hard to track and analyze tax data. To fix this, the study uses Natural Language
Processing (NLP) and Machine Learning (ML) to clean and group similar product names. A total of 4.1 million records
from 2020 to 2022 were translated into English and processed. Sentence meaning was captured using MiniLM embeddings,
then simplified using UMAP, and finally grouped using HDBSCAN. The cleaned and grouped product names make it easier
to detect possible fraud, spot underpricing, and improve the accuracy of tax reporting. This method helps RRA improve
data quality and tax compliance.
Keywords :
Multilingual Harmonization, Natural Language Processing (NLP), Machine Learning (ML), Product Name Clustering, Sentence Embedding, MiniLM, HDBSCAN, KMeans, UMAP, Language Detection, Tax Data Quality, Rwanda Revenue Authority (RRA), Electronic Billing Machine (EBM), Anomaly Detection, Cross-lingual Processing.