Validating SFinDSet: A High-Quality Synthetic Dataset for Financial Fraud Detection


Authors : Muhammad Nuraddeen Ado; Shafi’i Muhammad Abdulhamid; Idris Ismaila

Volume/Issue : Volume 11 - 2026, Issue 1 - January


Google Scholar : https://tinyurl.com/3t49c2pp

Scribd : https://tinyurl.com/mr5cepfz

DOI : https://doi.org/10.38124/ijisrt/26jan950

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Financial fraud remains a persistent and evolving threat, requiring robust machine learning (ML) models for effective detection. However, access to real-world financial transaction data is limited due to privacy restrictions and regulatory concerns, creating a gap in fraud detection research. This study introduces SFinDSet, a synthetic financial transaction dataset designed to simulate real-world banking operations for fraud detection, money laundering prevention, and financial risk assessment. The dataset's reliability was assessed through exploratory data analysis (EDA) and validated using anomaly detection techniques. To benchmark its performance, SFinDSet was evaluated against two established datasets: BankDSet (a real-world financial dataset) and SynFraudDataset (a synthetic fraud dataset). Various ML models, including Systematic Detection (SyD), Random Forest (RF), Isolation Forest (IF), DBSCAN, SVM, and PCA, were tested across these datasets. The results demonstrated that SyD achieved 100% recall, effectively detecting fraud while minimizing false negatives—outperforming traditional models, which exhibited high false negative rates. These findings validate SFinDSet as a reliable benchmark dataset, highlighting the critical role of synthetic financial datasets in advancing fraud detection research.

Keywords : Synthetic Financial Datasets, Fraud Detection, Machine Learning Models.

References :

  1. A. Alhchaimi, “Cloud-based transaction fraud detection: An in-depth analysis of ML algorithms,” Wasit Journal of Computer and Mathematics Science, 2024.
  2. E. Altman, B. Egressy, J. Blanuvsa, and K. Atasu, “Realistic synthetic financial transactions for anti-money laundering models,” ArXiv, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.16424
  3. S. Amjad, M. Younas, M. Anwar, Q. Shaheen, M. Shiraz, and A. Gani, “Data mining techniques to analyze the impact of social media on academic performance of high school students,” Wireless Communications and Mobile Computing, 2022. [Online]. Available: https://doi.org/10.1155/2022/9299115
  4. K. Anvesh, M. Srilatha, T. R. Reddy, M. G. Chand, and G. Jyothi, “Improving student academic performance using an attribute selection algorithm,” Advances in Intelligent Systems and Computing, 2018. [Online]. Available: https://doi.org/10.1007/978-981-13-1580-0_52
  5. A. Farissi, H. M. Dahlan, and Samsuryadi, “Genetic algorithm-based feature selection for predicting student's academic performance,” Lecture Notes in Computer Science, pp. 110–117, 2019. [Online]. Available: https://doi.org/10.1007/978-3-030-33582-3_11
  6. Kaggle, “Bank Transactions Dataset.” [Online]. Available: https://www.kaggle.com/datasets
  7. Kaggle, “Synthetic Fraud Dataset.” [Online]. Available: https://www.kaggle.com/datasets
  8. C. Hyginus, F. C. Eze, and C. I. Nwogu, “Review of the implications of uploading unverified dataset in a data banking site (Case study of Kaggle),” International Journal of Data Science Research, 2022.
  9. J. Huang, “The impact of mental health on academic performance: Comparative insights from original and simulated data,” Journal of Educational Psychology and Data Science, 2024.
  10. S. Jesus et al., “Turning the tables: Biased, imbalanced, dynamic tabular datasets for ML evaluation,” Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022. [Online]. Available: https://github.com/feedzai/bank-account-fraud
  11. T. Kuroki, “Integrating data science into an econometrics course with a Kaggle competition,” Journal of Econometrics Education, 2023.
  12. D. Kowald et al., “Using the Open Meta Kaggle Dataset to evaluate tripartite recommendations in data markets,” ArXiv, vol. abs/1908.04017, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1908.04017
  13. Z. Miao, “Financial fraud detection and prevention,” Journal of Organizational and End User Computing, 2024.
  14. A. Mohapatra, A. Kumar, B. Kumar, H. Agarwal, and R. Priyadarshini, “Synthetic data generation and handling data imbalance for mobile financial transactions,” 2024 IEEE 13th International Conference on Communication Systems and Network Technologies (CSNT), pp. 1197–1202, 2024. [Online]. Available: https://doi.org/10.1109/CSNT60213.2024.10546178
  15. D. C. Ruiz, D. Fletcher, A. Hall, and K. King, “Kaggle competitions in the classroom: Retrospectives and recommendations,” Operations Research & Management Science, vol. 47, no. 4, 2020.
  16. B. Stojanović and J. Bozic, “Robust financial fraud alerting system based in the cloud environment,” Sensors (Basel, Switzerland), vol. 22, 2022. [Online]. Available: https://consensus.app/papers/robust-financial-fraud-alerting-system-based-in-the-cloud-stojanović-bozic/2f9b68519e785a2aa0651f9e93becb55/?utm_source=chatgpt
  17. Y. Yang, Y. Yu, and T. Li, “Deep learning techniques for financial fraud detection,” 2022 14th International Conference on Computer Research and Development (ICCRD), pp. 16–22, 2022.
  18. Muhammad Nuraddeen Ado. (2025). SFinDSet for Systematic Detection of FinCrimes [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/11299085

Financial fraud remains a persistent and evolving threat, requiring robust machine learning (ML) models for effective detection. However, access to real-world financial transaction data is limited due to privacy restrictions and regulatory concerns, creating a gap in fraud detection research. This study introduces SFinDSet, a synthetic financial transaction dataset designed to simulate real-world banking operations for fraud detection, money laundering prevention, and financial risk assessment. The dataset's reliability was assessed through exploratory data analysis (EDA) and validated using anomaly detection techniques. To benchmark its performance, SFinDSet was evaluated against two established datasets: BankDSet (a real-world financial dataset) and SynFraudDataset (a synthetic fraud dataset). Various ML models, including Systematic Detection (SyD), Random Forest (RF), Isolation Forest (IF), DBSCAN, SVM, and PCA, were tested across these datasets. The results demonstrated that SyD achieved 100% recall, effectively detecting fraud while minimizing false negatives—outperforming traditional models, which exhibited high false negative rates. These findings validate SFinDSet as a reliable benchmark dataset, highlighting the critical role of synthetic financial datasets in advancing fraud detection research.

Keywords : Synthetic Financial Datasets, Fraud Detection, Machine Learning Models.

Paper Submission Last Date
31 - March - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe