Report on Phishing Website Detection using Machine Learning Techniques

Verified

Added on 2022/08/24

AI Summary

This report analyzes a machine learning approach to phishing website detection. The report discusses the use of Byte Pair Encoding (BPE) to generate tokens from HTML code, followed by TF-IDF weighting. It then explains the use of a Random Forest classifier to generate phishing website detection results. The report highlights the creation of a confusion matrix (containing true positive and false positive rates) and the effect of confidence threshold on test accuracy to determine whether a website is legitimate or a phishing site. The report acknowledges the effectiveness of the described process but also notes potential limitations, such as the large number of phishing websites and the possibility of attackers bypassing the system by altering the HTML codes.

Running head: PHISHING WEBSITE DETECTION USING MACHINE LEARNING
PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Name of the Student
Name of the University
Author Note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Introduction
In recent days, there has been a rise in phishing activity on multiple websites; due to
this reason, many people have lost their vital data resulting in loss of a lump sum money after
accessing an affected site. For this sole reason, ‘Phishing Website Detection with Machine
Learning’ post has been chosen to get a better understanding of how this phishing activity can
be detected using machine learning applications.
Discussion
The ‘Phishing Website Detection with Machine Learning’ blog post has been chosen
because of its simplicity in the describing process. Apart from the simplicity part, the
described method in attaining the Machine Learning application is quite useful as well
compared to any other posts available on the internet. With the rise of phishing activity over
the internet, many people are entering their confidential information believing that they are
entering the details on a genuine website. In the end, they (the people entering their valuable
information) end up losing a lot of money from their accounts (in most cases). Due to this
sole reason, there is a need to identify and separate the genuine websites from a pool of
various spoofed websites. Upon identifying the genuine from the spoofed websites, this
problem can be avoided and the selected post has done this simplistically and easily by
avoiding most of the complicated processes as much as possible.
The selected post has made the use of raw HTML codes to detect whether the website
is spoofed or not because for an attacker, s/he can easily use SSL to make the website URL
look like a genuine one. However, it is hard to obfuscate the site codes for preventing a
system from detecting its HTML codes. As per the post, initially, Byte Pair Encoding (BPE)
is used to generate tokens from any site’s HTML codes (for implementing this, tokenizer can
be found from the GitHub repository with instructions). Then TFIDF (term frequency,

2PHISHING WEBSITE DETECTION USING MACHINE LEARNING
inverse document frequency) weights are used on each BPE token. Next for the Machine
Learning Classifier part, default hyperparameters of RF (Random Forest) classifier are used
to generate Phishing Website Detection Results. In the end, a confusion matrix is created
(containing true positive and false positive rates). With the help of the effect of confidence
threshold on test accuracy, the final analysis is done upon whether the site is a spoofed or
legitimate one (Phishytics – Machine Learning for Detecting Phishing Websites - KDnuggets,
2020).
Conclusion
The process mentioned in this post does its job of detecting a phishing website very
proficiently. However, there are millions of phishing websites, but by following this current
post-process with around 15,000 datasets, there might be a problem in detecting newly made
phishing websites. Moreover, since raw HTML codes solely do the detecting process, an
attacker can simply observe the prediction process of the model used here and make
alterations to bypass the code and make the implemented model ineffective.

3PHISHING WEBSITE DETECTION USING MACHINE LEARNING
Reference
KDnuggets. 2020. Phishytics – Machine Learning For Detecting Phishing Websites -
Kdnuggets. [online] Available at: <https://www.kdnuggets.com/2020/03/phishytics-machine-
learning-detecting-phishing-websites.html> [Accessed 18 March 2020].

1 out of 4

Report on Phishing Website Detection using Machine Learning Techniques

Paraphrase This Document

Related Documents

Intelligent Systems Implementation: Phishing URL Detection Report

+13062052269

info@desklib.com

Report on Phishing Website Detection using Machine Learning Techniques

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

Intelligent Systems Implementation: Phishing URL Detection Report

+13062052269

info@desklib.com