Detecting Malicious URLs: A Machine Learning Approach using Feature Engineering and Ensemble Models


Authors : Saeed Hubairik Aliyu; Naeem Naseer; Bilal Muhammad

Volume/Issue : Volume 10 - 2025, Issue 5 - May


Google Scholar : https://tinyurl.com/49t5wf4d

DOI : https://doi.org/10.38124/ijisrt/25may1412

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : This work considers the use of machine learning to classify URLs into four categories: benign, defacement, phishing, and malware. In this research, a dataset used contains 651,191 URLs where there are 428,103 benign, 96,457 defacements’, 94,111 phishing, and 32,520 malware URLs. For this comparison, three machine learning models were used: Cat Boost classifier, Snapshot Ensemble, and Stacked Ensemble with Snapshots. The Cat Boost classifier was fairly accurate, at about 96%, with previsions ranging from 91% to 97% and recall from 82% to 99%, thus handling class imbalance rather well. Snapshot Ensemble scored an accuracy of about95.83%, thus performing quite great in classification tasks and handling model complexity and generalization effectively. Using Stacked Ensemble with Snapshots resulted in a somewhat-lower accuracy of 91.30% but high-performance variability across the different classes. These results have shown the power of ensemble techniques in enhancing classification performance and solving issues related to class imbalance. Future research should be directed toward the refinement of feature engineering techniques and real-time detection capabilities, focusing on high ethical standards with regard to public, readily available data, further contributing to the development of URL classification and thus to cybersecurity as a whole.

Keywords : URL Classification, Machine Learning, Cybersecurity, Ensemble Techniques, Phishing Detection, Malware Detection, Class Imbalance, Feature Engineering, Real-Time Detection, Precision and Recall.

References :

  1. Y. Zeng, “Malicious URLs and Attachments Detection on Lexical-based Features using Machine Learning Techniques,” 2018.
  2. B. Banik and A. Sarma, “Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques,” Commun. Comput. Inf. Sci., vol. 1241 CCIS, pp. 93–105, Jul. 2020, doi: 10.1007/978-981-15-6318-8_9.
  3. K. L. Chiew et al., “Building Standard Offline Anti-Phishing Dataset for Benchmarking,” Int. J. Eng. Technol., vol. 7, no. 4.31, pp. 7–14, Dec. 2018, doi: 10.14419/ijet. v7i4.31.23333.
  4. B. Banik and A. Sarma, “Phishing URL detection system based on URL features using SVM,” International Journal of Electronics and Applied Research, 2018.
  5. O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/J.ESWA.2018.09.029.
  6. “Yandex.XML — Yandex Teknolojileri.” https://yandex.com.tr/dev/xml/. A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. Gonzalez, “Classifying phishing URLs using recurrent neural networks,” in eCrime Researchers Summit, eCrime, Jun. 2017, pp. 1–8, doi: 10.1109/ECRIME.2017.7945048.
  7. W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer, and M. Woźniak, “Accurate and fast URL phishing detector: A convolutional neural network approach,” Comput. Networks, vol. 178, no. January, 2020, doi: 10.1016/j.comnet.2020.107275. “Technical challenge of network security.”
  8. https://www.kesci.com/apps/home/dataset/58f32a96a686fb29e42 5a567. “Reasonable Antiphishing,” [Online]. Available: http://antiphishing.reasonables.com/BlackList.aspx.
  9. R. Yang, K. Zheng, B. Wu, C. Wu, and X. Wang, “Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning,” Sensors, vol. 21, no. 24, p. 8281, Dec. 2021, doi: 10.3390/S21248281.
  10. “Yandex.Toloka Open Datasets.” https://research.yandex.com/datasets/toloka (accessed Jan. 16, 2022).
  11. J. Yuan, Y. Liu, and L. Yu, “A Novel Approach for Malicious URL Detection Based on the Joint Model,” Secur. Commun. Networks, vol. 2021, pp. 1–12, Dec. 2021, doi: 10.1155/2021/4917016. “Hphosts.” https://www.hosts-file.net/.
  12. B. Altay, T. Dokeroglu, and A. Cosar, “Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection,” Soft Comput., vol. 23, no. 12, pp. 4177–4191, Jun. 2019, doi: 10.1007/s00500-018-3066-4.
  13. “Cybersecurity to Prevent Breaches. | Comodo Cybersecurity.” = https://www.comodo.com.
  14. J. McGahagan, D. Bhansali, C. Pinto-Coelho, and M. Cukier, “A Comprehensive Evaluation of Webpage Content Features for Detecting Malicious Websites,” Nov. 2019, doi: 10.1109/LADC48089.2019.8995713.
  15. talosintelligence.com, “Snort || Cisco Talos Intelligence Group - Comprehensive Threat Intelligence,” Cisco, 2020. https://talosintelligence.com/snort. A. K. Jain and B. B. Gupta, “A machine learning based approach for phishing detection using hyperlinks information,” J. Ambient Intell. Humaniz. Comput., vol. 10, no. 5, pp. 2015– 2028, May 2019, doi: 10.1007/S12652-018-0798-Z.  “Welcome to CentOS.” http://www.stuffgate.com/ (accessed Jan. 19, 2022).
  16. M. Al-Kabi, H. Wahsheh, I. Alsmadi, E. Al-Shawakfa, A. Wahbeh, and A. Al-Hmoud, “Content-based analysis to detect Arabic web spam,” J. Inf. Sci., vol. 38, no. 3, pp. 284–296, Jun. 2012, doi: 10.1177/0165551512439173.
  17. Alsmadi, “The automatic evaluation of website metrics and state,” Int. J. Web-Based Learn. Teach. Technol., vol. 5, no. 4, pp. 1–17, 2010, doi: 10.4018/jwltt.2010100101.
  18. M. N. Al-Kabi, H. A. Wahsheh, and I. M. Alsmadi, “OLAWSDS: An Online Arabic Web Spam Detection System,” 2014.
  19. E. M., A. F., and H. E., “Web Mining Techniques to Block Spam Web Sites,” Int. J. Comput. Appl., vol. 181, no. 8, pp. 36–42, Aug. 2018, doi: 10.5120/ijca2018917622.

This work considers the use of machine learning to classify URLs into four categories: benign, defacement, phishing, and malware. In this research, a dataset used contains 651,191 URLs where there are 428,103 benign, 96,457 defacements’, 94,111 phishing, and 32,520 malware URLs. For this comparison, three machine learning models were used: Cat Boost classifier, Snapshot Ensemble, and Stacked Ensemble with Snapshots. The Cat Boost classifier was fairly accurate, at about 96%, with previsions ranging from 91% to 97% and recall from 82% to 99%, thus handling class imbalance rather well. Snapshot Ensemble scored an accuracy of about95.83%, thus performing quite great in classification tasks and handling model complexity and generalization effectively. Using Stacked Ensemble with Snapshots resulted in a somewhat-lower accuracy of 91.30% but high-performance variability across the different classes. These results have shown the power of ensemble techniques in enhancing classification performance and solving issues related to class imbalance. Future research should be directed toward the refinement of feature engineering techniques and real-time detection capabilities, focusing on high ethical standards with regard to public, readily available data, further contributing to the development of URL classification and thus to cybersecurity as a whole.

Keywords : URL Classification, Machine Learning, Cybersecurity, Ensemble Techniques, Phishing Detection, Malware Detection, Class Imbalance, Feature Engineering, Real-Time Detection, Precision and Recall.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe