Authors :
Saeed Hubairik Aliyu; Naeem Naseer; Bilal Muhammad
Volume/Issue :
Volume 10 - 2025, Issue 5 - May
Google Scholar :
https://tinyurl.com/49t5wf4d
DOI :
https://doi.org/10.38124/ijisrt/25may1412
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This work considers the use of machine learning to classify URLs into four categories: benign, defacement,
phishing, and malware. In this research, a dataset used contains 651,191 URLs where there are 428,103 benign, 96,457
defacements’, 94,111 phishing, and 32,520 malware URLs. For this comparison, three machine learning models were used:
Cat Boost classifier, Snapshot Ensemble, and Stacked Ensemble with Snapshots. The Cat Boost classifier was fairly
accurate, at about 96%, with previsions ranging from 91% to 97% and recall from 82% to 99%, thus handling class
imbalance rather well. Snapshot Ensemble scored an accuracy of about95.83%, thus performing quite great in
classification tasks and handling model complexity and generalization effectively. Using Stacked Ensemble with Snapshots
resulted in a somewhat-lower accuracy of 91.30% but high-performance variability across the different classes. These
results have shown the power of ensemble techniques in enhancing classification performance and solving issues related to
class imbalance. Future research should be directed toward the refinement of feature engineering techniques and real-time
detection capabilities, focusing on high ethical standards with regard to public, readily available data, further contributing
to the development of URL classification and thus to cybersecurity as a whole.
Keywords :
URL Classification, Machine Learning, Cybersecurity, Ensemble Techniques, Phishing Detection, Malware Detection, Class Imbalance, Feature Engineering, Real-Time Detection, Precision and Recall.
References :
- Y. Zeng, “Malicious URLs and Attachments Detection on Lexical-based Features using Machine Learning Techniques,” 2018.
- B. Banik and A. Sarma, “Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques,” Commun. Comput. Inf. Sci., vol. 1241 CCIS, pp. 93–105, Jul. 2020, doi: 10.1007/978-981-15-6318-8_9.
- K. L. Chiew et al., “Building Standard Offline Anti-Phishing Dataset for Benchmarking,” Int. J. Eng. Technol., vol. 7, no. 4.31, pp. 7–14, Dec. 2018, doi: 10.14419/ijet. v7i4.31.23333.
- B. Banik and A. Sarma, “Phishing URL detection system based on URL features using SVM,” International Journal of Electronics and Applied Research, 2018.
- O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/J.ESWA.2018.09.029.
- “Yandex.XML — Yandex Teknolojileri.” https://yandex.com.tr/dev/xml/. A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. Gonzalez, “Classifying phishing URLs using recurrent neural networks,” in eCrime Researchers Summit, eCrime, Jun. 2017, pp. 1–8, doi: 10.1109/ECRIME.2017.7945048.
- W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer, and M. Woźniak, “Accurate and fast URL phishing detector: A convolutional neural network approach,” Comput. Networks, vol. 178, no. January, 2020, doi: 10.1016/j.comnet.2020.107275. “Technical challenge of network security.”
- https://www.kesci.com/apps/home/dataset/58f32a96a686fb29e42 5a567. “Reasonable Antiphishing,” [Online]. Available: http://antiphishing.reasonables.com/BlackList.aspx.
- R. Yang, K. Zheng, B. Wu, C. Wu, and X. Wang, “Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning,” Sensors, vol. 21, no. 24, p. 8281, Dec. 2021, doi: 10.3390/S21248281.
- “Yandex.Toloka Open Datasets.” https://research.yandex.com/datasets/toloka (accessed Jan. 16, 2022).
- J. Yuan, Y. Liu, and L. Yu, “A Novel Approach for Malicious URL Detection Based on the Joint Model,” Secur. Commun. Networks, vol. 2021, pp. 1–12, Dec. 2021, doi: 10.1155/2021/4917016. “Hphosts.” https://www.hosts-file.net/.
- B. Altay, T. Dokeroglu, and A. Cosar, “Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection,” Soft Comput., vol. 23, no. 12, pp. 4177–4191, Jun. 2019, doi: 10.1007/s00500-018-3066-4.
- “Cybersecurity to Prevent Breaches. | Comodo Cybersecurity.” = https://www.comodo.com.
- J. McGahagan, D. Bhansali, C. Pinto-Coelho, and M. Cukier, “A Comprehensive Evaluation of Webpage Content Features for Detecting Malicious Websites,” Nov. 2019, doi: 10.1109/LADC48089.2019.8995713.
- talosintelligence.com, “Snort || Cisco Talos Intelligence Group - Comprehensive Threat Intelligence,” Cisco, 2020. https://talosintelligence.com/snort. A. K. Jain and B. B. Gupta, “A machine learning based approach for phishing detection using hyperlinks information,” J. Ambient Intell. Humaniz. Comput., vol. 10, no. 5, pp. 2015– 2028, May 2019, doi: 10.1007/S12652-018-0798-Z. “Welcome to CentOS.” http://www.stuffgate.com/ (accessed Jan. 19, 2022).
- M. Al-Kabi, H. Wahsheh, I. Alsmadi, E. Al-Shawakfa, A. Wahbeh, and A. Al-Hmoud, “Content-based analysis to detect Arabic web spam,” J. Inf. Sci., vol. 38, no. 3, pp. 284–296, Jun. 2012, doi: 10.1177/0165551512439173.
- Alsmadi, “The automatic evaluation of website metrics and state,” Int. J. Web-Based Learn. Teach. Technol., vol. 5, no. 4, pp. 1–17, 2010, doi: 10.4018/jwltt.2010100101.
- M. N. Al-Kabi, H. A. Wahsheh, and I. M. Alsmadi, “OLAWSDS: An Online Arabic Web Spam Detection System,” 2014.
- E. M., A. F., and H. E., “Web Mining Techniques to Block Spam Web Sites,” Int. J. Comput. Appl., vol. 181, no. 8, pp. 36–42, Aug. 2018, doi: 10.5120/ijca2018917622.
This work considers the use of machine learning to classify URLs into four categories: benign, defacement,
phishing, and malware. In this research, a dataset used contains 651,191 URLs where there are 428,103 benign, 96,457
defacements’, 94,111 phishing, and 32,520 malware URLs. For this comparison, three machine learning models were used:
Cat Boost classifier, Snapshot Ensemble, and Stacked Ensemble with Snapshots. The Cat Boost classifier was fairly
accurate, at about 96%, with previsions ranging from 91% to 97% and recall from 82% to 99%, thus handling class
imbalance rather well. Snapshot Ensemble scored an accuracy of about95.83%, thus performing quite great in
classification tasks and handling model complexity and generalization effectively. Using Stacked Ensemble with Snapshots
resulted in a somewhat-lower accuracy of 91.30% but high-performance variability across the different classes. These
results have shown the power of ensemble techniques in enhancing classification performance and solving issues related to
class imbalance. Future research should be directed toward the refinement of feature engineering techniques and real-time
detection capabilities, focusing on high ethical standards with regard to public, readily available data, further contributing
to the development of URL classification and thus to cybersecurity as a whole.
Keywords :
URL Classification, Machine Learning, Cybersecurity, Ensemble Techniques, Phishing Detection, Malware Detection, Class Imbalance, Feature Engineering, Real-Time Detection, Precision and Recall.