Phishing website, as known as spoofed site, is a web sites that attempt to impersonate the original web site or disguised as a legitimate web site to cheat the user for their personal information, password, credit card credentials or other financial purpose. The users without cybersecurity awareness can be deceived easily by the phishing web sites that are created but malicious actors. In order to distinguish between legitimate and malicious web site, machine learning technique has been used to detect phishing web site by analysing the web site’s URL.
A machine-learning based phishing web site detection system has been proposed in this report to determine whether an URL is legitimate or malicious. It is a supervised learning model that using lexical based approach which the model will analyse the features in the URL such as number of dots, number of delimiters, IP addresses and so forth. The dataset for training and testing models is taken from Aalto University which contains of total 96,018 URLs, half of them are legitimate and the rest are malicious. However, due to computational limitation (RAM capacity) and available resources, only 6251 of legitimate and malicious URLs have been used respectively.
Moreover, three algorithms have been used to which are Decision Tree, Adaboost, and KNN are used as the classifier. Overall, the Decision Tree classifier get the best result compares to another two classifiers with a slightly low accuracy and perform a transfer learning with the same dataset that contains another 6500 legitimate and malicious URLs to get a better result of from an accuracy of 81% to 83.7%.
Amid COVID-19 pandemic, phishing attacks have been increased twice compared to 2019 in SingaporeBAHARUDIN, H. (2020). The malicious actors create a phishing website that impersonate Singapore Government official website and send the URL via email and SMS to the target and claimed that the government is providing financial support that require users to enter their bank account and login credentials in order to get the financial supportSingapore Govt. (2020). In addition, the US Federal Trade Commission also show that 18 million USD have been scammed over the 12,000 fraud cases that is related to COVID-19Singcert. (2020). Therefore, there is a need to create a system to deal with phishing URLs in order to protect the online users.
Phishing attacks via a phishing web site is not a new technique. There are several ways to prevent or detect phishing web site, one of the most common method nowadays is machine learning technique due to its scalability and adaptabilityOumaima El Kouari, Hafssa Benaboud, and Saiida Lazaar. (2020). Within these few years, there are a lot of algorithms have been proposed for phishing URLs detection with the lexical features.
In paper published bySurya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020), authors proposed a learning-based model to detect phishing activities via the URLs. They categorized the features of phishing web site into three categories, which are
i) Lexical features, ii) NLP features and iii) Host-based features. The lexical feature is the most common method to detect phishing website. For example, the malicious actors attempt to remain unsuspected to the users by using a long domain name, or other special characters that used to appear in phishing web site such as ‘?’, ‘//’, ‘.’ and so forth. The NLP feature is used to look for random words, keywords, and brand names of the phishing web sites. The host-based feature is to look for its domain properties via WHOIS service for the creation date, expiration date and registrar name. Since phishing URLs only stay for a short time, and normally contains IP addresses in their host name. These three categories of features have been used to detect phishing websites. Moreover, they used wide range of machine learning algorithms such as KNN (K-Nearest Neighbour), LR (Logistic Regression), SVM (Support Vector Machine), GBC (Gradient Boosting Classifier), ABC (Ada Boost Classifier), and RFC (Random Forest Classifier) to train the model with the dataset that contains 36,400 legitimate and 37,175 phishing URLs. Overall, the ABC get the best result of 94% accuracy in 22,073 URLs testing set. Therefore, Ada Boost Classifier has been chosen as one of the classifiers for this paper.
In papers released byL. Machado and J. Gadge (2017)andVijaya, M.S.. (2012)both authors proposed a machine learning model based on lexical features with DT (Decision Tree) classifier. The main differences between L. Machado and J. Gadge (2017) and Vijaya, M.S.. (2012) are dataset used and the number of features. For L. Machado and J. Gadge (2017), their training dataset contains of 1500 legitimate and 1500 phishing URLs with 9 features, and testing dataset of another 3000 URLs. The result they achieve is 89.4% accuracy. For Vijaya, M.S.. (2012), their training dataset and testing dataset are relatively smaller than any others, only contains 100 legitimate and 100 phishing URLs with 17 features. The result they achieve is 98.5% accuracy. Therefore, a less amount of dataset with more features get a very good result, and a less features with a large amount of dataset get a decent result.
In Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri, (2017), several methodologies have been proposed by the author, such as KNN, ABC and DT. In tern of lexical features, the result show that KNN get the best results of 83.01% accuracy compared to ABC of 74.74% accuracy and DT of 82.48% accuracy. However, these results are not the same as Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020), both of them were using the same dataset that contain the same amount of legitimate and malicious URLs and same methodologies. Therefore, there is a need to experiment KNN, ABC, and DT classifier with our dataset.
In addition, Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020) shown that DT get the best result of 97.29% accuracy and ABC get a slightly lower result of 95.93% accuracy with the same dataset that contains only 1116 legitimate and 1428 malicious URLs.
Lastly, we proposed a machine learning based model with lexical features to detect phishing URLs since the five papers get an excellent result of 74.74% to 98.5% accuracy with several different algorithms. The algorithms we proposed are KNN, ABC and DT since these three algorithms get the best results in Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri, (2017). Therefore, the purpose of this paper is to find out which algorithm is the best for detecting phishing URLs.
Approach and Implementation
The whole idea of this project concept was to come up with hybrid ML solution. For this purpose, we have combined 2 popular ML approaches. 1 is the feature extraction, and classification, and secondly transfer learning.
The following process flow diagram illustrates the whole concept of this project.
Screenshot1Data Flow Diagram
The process starts from the pre-processed dataset. This dataset consists the URL labelled as 0, and 1 (where 0 indicated legitimate URL, and 1 indicates phishing URL). Once, the dataset is loaded we’re specifying the features to be extracted.
Above image shows how the feature extraction going to work. Similar with this example, for our project we extracted the features mentioned in the following table,
Number of dots (.) in subdomain
Hyphen in domain (-)
At in domain (@)
Double slash (//)
Whether IP is used as URL
The following screenshot shows the few lines of code used for few of features extracted.
Screenshot2 Code for feature extraction
So, once the features are extracted, we have put them into a feature set which we going to use in later steps of the program.
For this program we are using three classifiers, they are namely, Decision Tree, Adaboost, and KNN. After the first test results, we have performed tuning to see how efficient it is. For tuning purpose, we have used GridSearchCV. The following table shows the results of before and after tuning.
While seeing the performance results we can identify that slight variations between before and after tuning. For decision tree, and KNN the after results are higher than before tuning, whereas for Adaboost the after tuning results are little bit lower than the before tuning result.
Finalizing the classifier
Since, decision tree results are quite better than other two classifiers we decided to select it as our final classifier.
Applying transfer learning
So, after finalising the classifier we have transferred the decision tree feature set to our test dataset to validate.
Screenshot3Transfer learning to test dataset
Screenshot 3 shows the accuracy of decision tree classifier on our new test dataset, in this we can see that the accuracy on new dataset is higher than previous results which indicates its better performance.
Screenshot4 Testing with Phishing URL
Screenshot5 Testing with Legitimate URL
Screenshot 4, and 5 shows testing the solution by providing the input. For the purpose of screenshot 4 we have given the URL that is taken from a Phishing URL collection website called as Phish tank (Phish Tank, N.d). As we expected the ML program displays 1 which indicates the given URL is a phishing one. For the screenshot 5 we have given the input of our school webpage URL and the ML program displays 0 which indicates the given URL is legitimate one.
The training dataset and testing dataset are created in 2014 and publicly available to download from Aalto University . It is a simple CSV file that contains 96,018 URLs, 48,009 is legitimate URLs and the remaining 48,009 are phishing URLs. Technically, it only consists of one label, which is 0 or 1. 0 represents legitimate URL, 1 represents phishing URL. The screenshot below shows the amount of labelled data.
Screenshot 6 The amount of labelled data
Within 96,018 URLs, only 13,749 of training data are picked due to the computational limitation of computer resources. It consists of 6251 legitimate URLs and 7498 phishing URLs. In order to reduce bias and accuracy, the legitimate and phishing URLs have been balanced to 6251 URLs only. The plot below shows the balanced classes.
Screenshot 7 Dataset classification
Moreover, another 6500 URLs from the Phish Storm have been downloaded for transfer learning. It contains 3250 legitimate URLs and 3250 malicious URLs. The label is the same with previous training and testing dataset. The table below shows the first 5 data of in the transfer learning dataset.
Screenshot 8 Dataset contents
In paper Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020), L. Machado and J. Gadge (2017), and Vijaya, M.S.. (2012), the Decision Tree algorithms get the best result. In the experiment with our dataset, the result shows that Decision Tree is the best in classifying phishing URL, regardless the amount of data and feature. However, the performance among Decision Tree, Ada Boost Classifier, and K-Nearest Neighbour is relatively closed even before tuning with GridsearchCV. The comparison table in the previous section shows that the accuracy of Decision Tree is only 1%.
Lastly, we reuse the Decision Tree model to perform transfer learning. The results show that the accuracy increased around 3% compared to the previous result. The transfer learning optimises the progress and improve the performance at the same time. The final result and other metrics can refer to the screenshots below.
Screenshot 9 Decision Tree Classifier Final Result
Hardware – i5-7200U, 4 GB DDR4 RAM, 20 GB HDD Storage.
Software – Anaconda Individual Edition 4.8.3 (Python 3.8), Jupyter Notebook.
In addition to this project, many enhancements can be done on top of it. One main enhancement could be including more classifier algorithms like KNN classifier, etc... Also, for the common use of this project front-end work can be done with Flask framework that utilises our ML solution. With this a website can be created and people can place the URL then can get the results whether the submitted URL is Phishing or Legitimate.
Literature review, Documentation, Code debug
Literature review, Data Set Preparation, Documentation, Classifier Programming
Literature review, Documentation, Feature extraction programming.
In this paper, we illustrated the implementation concept flow diagram for the project. We combined two popular machine learning approaches, feature extractions classification and transfer learning. Applied the three popular classifiers and performed tuning by usingGridSearchCV. After comparing the results, Decision Tree was selected as the final classifier and transferred its feature and set to our test dataset to validate. We described the performance evaluation which includethe dataset we used for the testing and compare performance of chosen algorithm with another algorithm. We have learned that dataset is very important in Machine Learning. Some phishing websites looks too closed to the legitimate websites and the properties of the URLs are configured as same as to perform as legitimate website. To detect most all of them, we more need to train the model with suitable features. Machine do not make decision and we need to use dataset, feature extraction and algorithms in a proper way.
Our happy customers
They are fast in responding to homework questions. they have the best technical writers. Thanks for helping me with my programming doubts.
I contact to disklib for homework, they help me out, despite there was some technical issue they gone through extra mile for me and provide me good quality work in first priority. 100% recommended.
Desklib's study resources are best & unique. Their study database is easy to access and easy to use.
100 % recommended.