Report on Phishing Website Detection using Machine Learning Techniques

Verified

Added on  2022/08/24

|4
|622
|27
Report
AI Summary
This report analyzes a machine learning approach to phishing website detection. The report discusses the use of Byte Pair Encoding (BPE) to generate tokens from HTML code, followed by TF-IDF weighting. It then explains the use of a Random Forest classifier to generate phishing website detection results. The report highlights the creation of a confusion matrix (containing true positive and false positive rates) and the effect of confidence threshold on test accuracy to determine whether a website is legitimate or a phishing site. The report acknowledges the effectiveness of the described process but also notes potential limitations, such as the large number of phishing websites and the possibility of attackers bypassing the system by altering the HTML codes.
Document Page
Running head: PHISHING WEBSITE DETECTION USING MACHINE LEARNING
PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Name of the Student
Name of the University
Author Note
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
1PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Introduction
In recent days, there has been a rise in phishing activity on multiple websites; due to
this reason, many people have lost their vital data resulting in loss of a lump sum money after
accessing an affected site. For this sole reason, ‘Phishing Website Detection with Machine
Learning’ post has been chosen to get a better understanding of how this phishing activity can
be detected using machine learning applications.
Discussion
The ‘Phishing Website Detection with Machine Learning’ blog post has been chosen
because of its simplicity in the describing process. Apart from the simplicity part, the
described method in attaining the Machine Learning application is quite useful as well
compared to any other posts available on the internet. With the rise of phishing activity over
the internet, many people are entering their confidential information believing that they are
entering the details on a genuine website. In the end, they (the people entering their valuable
information) end up losing a lot of money from their accounts (in most cases). Due to this
sole reason, there is a need to identify and separate the genuine websites from a pool of
various spoofed websites. Upon identifying the genuine from the spoofed websites, this
problem can be avoided and the selected post has done this simplistically and easily by
avoiding most of the complicated processes as much as possible.
The selected post has made the use of raw HTML codes to detect whether the website
is spoofed or not because for an attacker, s/he can easily use SSL to make the website URL
look like a genuine one. However, it is hard to obfuscate the site codes for preventing a
system from detecting its HTML codes. As per the post, initially, Byte Pair Encoding (BPE)
is used to generate tokens from any site’s HTML codes (for implementing this, tokenizer can
be found from the GitHub repository with instructions). Then TFIDF (term frequency,
Document Page
2PHISHING WEBSITE DETECTION USING MACHINE LEARNING
inverse document frequency) weights are used on each BPE token. Next for the Machine
Learning Classifier part, default hyperparameters of RF (Random Forest) classifier are used
to generate Phishing Website Detection Results. In the end, a confusion matrix is created
(containing true positive and false positive rates). With the help of the effect of confidence
threshold on test accuracy, the final analysis is done upon whether the site is a spoofed or
legitimate one (Phishytics – Machine Learning for Detecting Phishing Websites - KDnuggets,
2020).
Conclusion
The process mentioned in this post does its job of detecting a phishing website very
proficiently. However, there are millions of phishing websites, but by following this current
post-process with around 15,000 datasets, there might be a problem in detecting newly made
phishing websites. Moreover, since raw HTML codes solely do the detecting process, an
attacker can simply observe the prediction process of the model used here and make
alterations to bypass the code and make the implemented model ineffective.
Document Page
3PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Reference
KDnuggets. 2020. Phishytics – Machine Learning For Detecting Phishing Websites -
Kdnuggets. [online] Available at: <https://www.kdnuggets.com/2020/03/phishytics-machine-
learning-detecting-phishing-websites.html> [Accessed 18 March 2020].
chevron_up_icon
1 out of 4
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]