logo

CSG2341 – Intelligent Systems

16 Pages3047 Words156 Views
   

Added on  2021-04-29

CSG2341 – Intelligent Systems

   Added on 2021-04-29

ShareRelated Documents
CSG2341 – Intelligent SystemsImplementation Report:Phishing URL Detection Machine Learning And LexicalFeatures ExtractionTable of ContentsAbstract.......................................................................................................................................................3Introduction.................................................................................................................................................3Related Works.........................................................................................................................................4Approach and Implementation....................................................................................................................5Pre-processed Dataset..............................................................................................................................5
CSG2341 – Intelligent Systems_1
Features extraction...................................................................................................................................6Applying classifiers.................................................................................................................................7Finalizing the classifier............................................................................................................................7Applying transfer learning...........................................................................................................................7Providing Input........................................................................................................................................8Performance Evaluation..............................................................................................................................8Specifications............................................................................................................................................10Future Work..............................................................................................................................................10Contribution..............................................................................................................................................10Conclusion.................................................................................................................................................11References.................................................................................................................................................12Appendix...................................................................................................................................................13AbstractPhishing website, as known as spoofed site, is a web sites that attempt to impersonate theoriginal web site or disguised as a legitimate web site to cheat the user for their personalinformation, password, credit card credentials or other financial purpose. The users withoutcybersecurity awareness can be deceived easily by the phishing web sites that are created butmalicious actors. In order to distinguish between legitimate and malicious web site, machinelearning technique has been used to detect phishing web site by analysing the web site’s URL.
CSG2341 – Intelligent Systems_2
A machine-learning based phishing web site detection system has been proposed in thisreport to determine whether an URL is legitimate or malicious. It is a supervised learning modelthat using lexical based approach which the model will analyse the features in the URL such asnumber of dots, number of delimiters, IP addresses and so forth. The dataset for training andtesting models is taken from Aalto University which contains of total 96,018 URLs, half of themare legitimate and the rest are malicious. However, due to computational limitation (RAMcapacity) and available resources, only 6251 of legitimate and malicious URLs have been usedrespectively.Moreover, three algorithms have been used to which are Decision Tree, Adaboost, andKNN are used as the classifier. Overall, the Decision Tree classifier get the best result comparesto another two classifiers with a slightly low accuracy and perform a transfer learning with thesame dataset that contains another 6500 legitimate and malicious URLs to get a better result offrom an accuracy of 81% to 83.7%.IntroductionAmid COVID-19 pandemic, phishing attacks have been increased twice compared to 2019 in Singapore BAHARUDIN, H. (2020). The malicious actors create a phishing website thatimpersonate Singapore Government official website and send the URL via email and SMS to thetarget and claimed that the government is providing financial support that require users to enter their bank account and login credentials in order to get the financial support Singapore Govt. (2020). In addition, the US Federal Trade Commission also show that 18 million USD have beenscammed over the 12,000 fraud cases that is related to COVID-19 Singcert. (2020). Therefore, there is a need to create a system to deal with phishing URLs in order to protect the online users.Phishing attacks via a phishing web site is not a new technique. There are several ways toprevent or detect phishing web site, one of the most common method nowadays is machinelearning technique due to its scalability and adaptability Oumaima El Kouari, Hafssa Benaboud,and Saiida Lazaar. (2020). Within these few years, there are a lot of algorithms have beenproposed for phishing URLs detection with the lexical features. Related WorksIn paper published by Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020),authors proposed a learning-based model to detect phishing activities via the URLs. Theycategorized the features of phishing web site into three categories, which are i) Lexical features, ii) NLP features and iii) Host-based features. The lexical feature is the mostcommon method to detect phishing website. For example, the malicious actors attempt to remainunsuspected to the users by using a long domain name, or other special characters that used to
CSG2341 – Intelligent Systems_3
appear in phishing web site such as ‘?’, ‘//’, ‘.’ and so forth. The NLP feature is used to look forrandom words, keywords, and brand names of the phishing web sites. The host-based feature isto look for its domain properties via WHOIS service for the creation date, expiration date andregistrar name. Since phishing URLs only stay for a short time, and normally contains IPaddresses in their host name. These three categories of features have been used to detect phishingwebsites. Moreover, they used wide range of machine learning algorithms such as KNN (K-Nearest Neighbour), LR (Logistic Regression), SVM (Support Vector Machine), GBC (GradientBoosting Classifier), ABC (Ada Boost Classifier), and RFC (Random Forest Classifier) to trainthe model with the dataset that contains 36,400 legitimate and 37,175 phishing URLs. Overall,the ABC get the best result of 94% accuracy in 22,073 URLs testing set. Therefore, Ada BoostClassifier has been chosen as one of the classifiers for this paper.In papers released by L. Machado and J. Gadge (2017) and Vijaya, M.S.. (2012) both authorsproposed a machine learning model based on lexical features with DT (Decision Tree) classifier.The main differences between L. Machado and J. Gadge (2017) and Vijaya, M.S.. (2012) aredataset used and the number of features. For L. Machado and J. Gadge (2017), their trainingdataset contains of 1500 legitimate and 1500 phishing URLs with 9 features, and testing datasetof another 3000 URLs. The result they achieve is 89.4% accuracy. For Vijaya, M.S.. (2012),their training dataset and testing dataset are relatively smaller than any others, only contains 100legitimate and 100 phishing URLs with 17 features. The result they achieve is 98.5% accuracy.Therefore, a less amount of dataset with more features get a very good result, and a less featureswith a large amount of dataset get a decent result.In Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri, (2017), severalmethodologies have been proposed by the author, such as KNN, ABC and DT. In tern of lexicalfeatures, the result show that KNN get the best results of 83.01% accuracy compared to ABC of74.74% accuracy and DT of 82.48% accuracy. However, these results are not the same as SuryaSrikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020), both of them were using thesame dataset that contain the same amount of legitimate and malicious URLs and samemethodologies. Therefore, there is a need to experiment KNN, ABC, and DT classifier with ourdataset.In addition, Surya Srikar Sirigineedi, Jayesh Soni, and Himanshu Upadhyay. (2020)shown that DT get the best result of 97.29% accuracy and ABC get a slightly lower result of95.93% accuracy with the same dataset that contains only 1116 legitimate and 1428 maliciousURLs.Lastly, we proposed a machine learning based model with lexical features to detectphishing URLs since the five papers get an excellent result of 74.74% to 98.5% accuracy withseveral different algorithms. The algorithms we proposed are KNN, ABC and DT since thesethree algorithms get the best results in Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir,Banu Diri, (2017). Therefore, the purpose of this paper is to find out which algorithm is the bestfor detecting phishing URLs.
CSG2341 – Intelligent Systems_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Detection of Phishing Websites Using Machine Learning
|16
|1468
|94

Assignment on Phishing Website Detection Using Machine Learning
|4
|622
|27

Mobile Phishing Websites Detection and Prevention Using Data Mining Techniques
|4
|574
|17