Comparative Analysis of TinyBERT, SVM, and Char-CNN Models for Phishing URL Detection

Haeranisa Bella Krisanti; Chaerul Umam

doi:10.32520/stmsi.v15i5.6345

Comparative Analysis of TinyBERT, SVM, and Char-CNN Models for Phishing URL Detection

Haeranisa Bella Krisanti, Chaerul Umam

Abstract

Phishing is one of the most prevalent cybersecurity threats that exploits malicious URLs to deceive users and steal sensitive information. This study proposes a URL-based phishing detection method using the lightweight Transformer model TinyBERT and compares its performance with three baseline models: SVM based on character n-grams, Random Forest based on lexical URL features, and Char-CNN. The dataset used in this study consists of 49,750 URLs with multi-class labels (benign, defacement, malware, and phishing), which were subsequently binarized into phishing (label 1) and non-phishing (label 0). The data were divided using a stratified split into training, validation, and testing sets with a ratio of 70%–15%–15%. To address class imbalance, the TinyBERT model was trained using a weighted loss approach based on class weights. The evaluation was conducted using a confusion matrix, accuracy, precision, recall, F1-score, as well as ROC and Precision–Recall curves. Experimental results demonstrate that TinyBERT achieved the best performance, with an accuracy of 0.9925, phishing recall of 0.9512, and an F1-score of 0.9387. In addition, the model produced the lowest number of false negatives (22) compared with the baseline models. These findings indicate that TinyBERT is more effective in minimizing phishing URLs that are incorrectly classified as benign, making it more suitable for implementing URL-based phishing detection in cybersecurity systems.

Keywords

Char-CNN; phishing; SVM; TinyBert; URL phishing detection; random forest

Full Text:

PDF

References

X. Jiao et al., “TinyBERT : Distilling BERT for Natural Language Understanding,” Find. ofthe Assoc. Comput. Linguist. EMNLP 2020, pp. 4163–4174, 2020, DOI: 10.18653/v1/2020.findings-emnlp.372.

R. Liu, Y. Wang, Z. Guo, H. Xu, and Z. Qin, “Transurl: Improving Malicious URL Detection with Multi-Layer Transformer Encoding and Multi-Scale Pyramid Features”, DOI: 10.1016/j.comnet.2024.110707.

N. Q. Do, A. N. Selamat, H. Fujita, and O. Krejcar, “An Integrated Model based on Deep Learning Classifiers and Pre-Trained Transformer for Phishing URL Detection,” Vol. 161, No. December, 2024, DOI: doi.Org/10.1016/j.future.2024.06.031.

A. Selman, F. Coskun, and M. Aydos, “Computers & Security GramBeddings : A New Neural Network for URL based Identification of Phishing Web Pages Through N-Gram Embe ddings,” Comput. Secur., Vol. 124, p. 102964, 2023, DOI: 10.1016/j.cose.2022.102964.

Q. Emad, M. H. Faheem, and I. Ahmad, “Detecting Phishing URLs based on a Deep Learning Approach to Prevent Cyber-Attacks Fourth Quarter Quarter 2023 Stats Fourth,” 2024, DOI: 10.3390/app142210086.

A. Ozcan, C. Catal, E. Donmez, and B. Senturk, “A Hybrid DNN – LSTM Model for Detecting Phishing URLs,” Neural Comput. Appl., Vol. 35, No. 7, pp. 4957–4973, 2023, DOI: 10.1007/s00521-021-06401-z.

S. Srinivasan, R. Vinayakumar, A. Arunachalam, M. Alazab, and S. Kp, “Malicious URL Detection using Deep Learning,” Prepr. Submitt. to Elsevier, pp. 1–19, 2021.

X. Xiao et al., “Phishing Websites Detection via CNN and Multi-Head Self-Attention on Imbalanced Datasets R,” Comput. Secur., Vol. 108, p. 102372, 2021, DOI: 10.1016/j.cose.2021.102372.

A. Safi and S. Singh, “A Systematic Literature Review on Phishing Website Detection Techniques,” J. King Saud Univ. - Comput. Inf. SCI., Vol. 35, No. 2, pp. 590–611, 2023, DOI: 10.1016/j.jksuci.2023.01.004.

M. Alshehri, A. Abugabah, A. Algarni, and S. Almotairi, “Character-Level Word Encoding Deep Learning Model for Combating Cyber Threats in Phishing URL Detection,” Comput. Electr. Eng., Vol. 100, No. March, p. 107868, 2022, DOI: 10.1016/j.compeleceng.2022.107868.

C. Wang and Y. Chen, “Knowledge-based Systems TCURL : Exploring Hybrid Transformer and Convolutional Neural Network on Phishing URL Detection,” Knowledge-based Syst., Vol. 258, p. 109955, 2022, DOI: 10.1016/j.knosys.2022.109955.

Z. Chen, Y. Liu, C. Chen, M. Lu, and X. Zhang, “Malicious URL Detection based on Improved Multilayer Recurrent Convolutional Neural Network Model,” Secur. Commun. Networks, Vol. 2021, 2021, DOI: 10.1155/2021/9994127.

A. Safi and S. Singh, “A systematic literature review on phishing website detection

Techniques,” J. King Saud Univ. - Comput. Inf. SCI., Vol. 35, No. 2, pp. 590–611, 2023, DOI: 10.1016/j.jksuci.2023.01.004.

S. K. D. Sumathi, “Staying Ahead of Phishers : A Review of Recent Advances and Emerging

Methodologies in Phishing Detection,” Artif. Intell. Rev., 2025, DOI: 10.1007/s10462-024-11055-z.

H. Pippalla, “Malicious URLs Dataset (40k Samples),” Kaggle, 2022. [Online]. Available: https://www.kaggle.com/datasets/himadri07/malicious-urls-dataset-15k-rows

DOI: https://doi.org/10.32520/stmsi.v15i5.6345

Article Metrics

Abstract view : 20 times
PDF - 2 times

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me