CSG2341 – Intelligent Systems

Added on - 29 Apr 2021

  • 16

    Pages

  • 3047

    Words

  • 10

    Views

  • 0

    Downloads

Trusted by +2 million users,
1000+ happy students everyday
Showing pages 1 to 4 of 16 pages
CSG2341 – Intelligent SystemsImplementation Report:Phishing URL Detection Machine Learning And LexicalFeatures ExtractionTable of ContentsAbstract.......................................................................................................................................................3Introduction.................................................................................................................................................3Related Works.........................................................................................................................................4Approach and Implementation....................................................................................................................5Pre-processed Dataset..............................................................................................................................5
Features extraction...................................................................................................................................6Applying classifiers.................................................................................................................................7Finalizing the classifier............................................................................................................................7Applying transfer learning...........................................................................................................................7Providing Input........................................................................................................................................8Performance Evaluation..............................................................................................................................8Specifications............................................................................................................................................10Future Work..............................................................................................................................................10Contribution..............................................................................................................................................10Conclusion.................................................................................................................................................11References.................................................................................................................................................12Appendix...................................................................................................................................................13AbstractPhishing website, as known as spoofed site, is a web sites that attempt to impersonate theoriginal web site or disguised as a legitimate web site to cheat the user for their personalinformation, password, credit card credentials or other financial purpose. The users withoutcybersecurity awareness can be deceived easily by the phishing web sites that are created butmalicious actors. In order to distinguish between legitimate and malicious web site, machinelearning technique has been used to detect phishing web site by analysing the web site’s URL.
A machine-learning based phishing web site detection system has been proposed in thisreport to determine whether an URL is legitimate or malicious. It is a supervised learning modelthat using lexical based approach which the model will analyse the features in the URL such asnumber of dots, number of delimiters, IP addresses and so forth. The dataset for training andtesting models is taken from Aalto University which contains of total 96,018 URLs, half of themare legitimate and the rest are malicious. However, due to computational limitation (RAMcapacity) and available resources, only 6251 of legitimate and malicious URLs have been usedrespectively.Moreover, three algorithms have been used to which are Decision Tree, Adaboost, andKNN are used as the classifier. Overall, the Decision Tree classifier get the best result comparesto another two classifiers with a slightly low accuracy and perform a transfer learning with thesame dataset that contains another 6500 legitimate and malicious URLs to get a better result offrom an accuracy of 81% to 83.7%.IntroductionAmid COVID-19 pandemic, phishing attacks have been increased twice compared to2019 in SingaporeBAHARUDIN, H. (2020). The malicious actors create a phishing website thatimpersonate Singapore Government official website and send the URL via email and SMS to thetarget and claimed that the government is providing financial support that require users to entertheir bank account and login credentials in order to get the financial supportSingapore Govt.(2020). In addition, the US Federal Trade Commission also show that 18 million USD have beenscammed over the 12,000 fraud cases that is related to COVID-19Singcert. (2020). Therefore,there is a need to create a system to deal with phishing URLs in order to protect the online users.Phishing attacks via a phishing web site is not a new technique. There are several ways toprevent or detect phishing web site, one of the most common method nowadays is machinelearning technique due to its scalability and adaptabilityOumaima El Kouari, Hafssa Benaboud,and Saiida Lazaar. (2020). Within these few years, there are a lot of algorithms have beenproposed for phishing URLs detection with the lexical features.Related WorksIn paper published bySurya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020),authors proposed a learning-based model to detect phishing activities via the URLs. Theycategorized the features of phishing web site into three categories, which arei) Lexical features, ii) NLP features and iii) Host-based features. The lexical feature is the mostcommon method to detect phishing website. For example, the malicious actors attempt to remainunsuspected to the users by using a long domain name, or other special characters that used to
appear in phishing web site such as ‘?’, ‘//’, ‘.’ and so forth. The NLP feature is used to look forrandom words, keywords, and brand names of the phishing web sites. The host-based feature isto look for its domain properties via WHOIS service for the creation date, expiration date andregistrar name. Since phishing URLs only stay for a short time, and normally contains IPaddresses in their host name. These three categories of features have been used to detect phishingwebsites. Moreover, they used wide range of machine learning algorithms such as KNN (K-Nearest Neighbour), LR (Logistic Regression), SVM (Support Vector Machine), GBC (GradientBoosting Classifier), ABC (Ada Boost Classifier), and RFC (Random Forest Classifier) to trainthe model with the dataset that contains 36,400 legitimate and 37,175 phishing URLs. Overall,the ABC get the best result of 94% accuracy in 22,073 URLs testing set. Therefore, Ada BoostClassifier has been chosen as one of the classifiers for this paper.In papers released byL. Machado and J. Gadge (2017)andVijaya, M.S.. (2012)both authorsproposed a machine learning model based on lexical features with DT (Decision Tree) classifier.The main differences between L. Machado and J. Gadge (2017) and Vijaya, M.S.. (2012) aredataset used and the number of features. For L. Machado and J. Gadge (2017), their trainingdataset contains of 1500 legitimate and 1500 phishing URLs with 9 features, and testing datasetof another 3000 URLs. The result they achieve is 89.4% accuracy. For Vijaya, M.S.. (2012),their training dataset and testing dataset are relatively smaller than any others, only contains 100legitimate and 100 phishing URLs with 17 features. The result they achieve is 98.5% accuracy.Therefore, a less amount of dataset with more features get a very good result, and a less featureswith a large amount of dataset get a decent result.In Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri, (2017), severalmethodologies have been proposed by the author, such as KNN, ABC and DT. In tern of lexicalfeatures, the result show that KNN get the best results of 83.01% accuracy compared to ABC of74.74% accuracy and DT of 82.48% accuracy. However, these results are not the same as SuryaSrikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020), both of them were using thesame dataset that contain the same amount of legitimate and malicious URLs and samemethodologies. Therefore, there is a need to experiment KNN, ABC, and DT classifier with ourdataset.In addition, Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020)shown that DT get the best result of 97.29% accuracy and ABC get a slightly lower result of95.93% accuracy with the same dataset that contains only 1116 legitimate and 1428 maliciousURLs.Lastly, we proposed a machine learning based model with lexical features to detectphishing URLs since the five papers get an excellent result of 74.74% to 98.5% accuracy withseveral different algorithms. The algorithms we proposed are KNN, ABC and DT since thesethree algorithms get the best results in Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir,Banu Diri, (2017). Therefore, the purpose of this paper is to find out which algorithm is the bestfor detecting phishing URLs.
desklib-logo
You’re reading a preview
Preview Documents

To View Complete Document

Click the button to download
Subscribe to our plans

Download This Document