logo

Data Analytics for Cybersecurity

   

Added on  2022-11-29

28 Pages2909 Words246 Views
ANALYTICS 1
DATA ANALYTICS FOR CYBERSECURITY
NAME OF STUDENT
NAME OF PROFESSOR
NAME OF CLASS
NAME OF SCHOOL
STATE AND CITY OF SCHOOL
DATE

ANALYTICS 2
Table of Contents
Executive Summary................................................................................................... 2
Introduction................................................................................................................ 3
Literature Review....................................................................................................... 5
Technical Demonstration............................................................................................ 6
Decision Tree.......................................................................................................... 7
Logistic Regression................................................................................................ 11
Random Forest...................................................................................................... 15
Naïve Bayes.......................................................................................................... 19
K-Means................................................................................................................. 20
Performance Evaluation........................................................................................... 21
Conclusion................................................................................................................ 25
References............................................................................................................... 26

ANALYTICS 3
Executive Summary
The study realized that the datasets that were provided could have been converted
from text files into CSV files for analysis. The report will be highly elaborative with
up to over eight sections to aid provide different views on the analysis of the
dataset and the actual classification of the response variable. The whole topic will
be based on the application of data mining and machine learning on texts that
originate from tweets. Take for example all social media platforms that are used for
communication between different people, there is always the need to sieve through
different text messages that are being sent from one individual to the next
individual. This is an action that seeks to keep the innocent users of such social
platforms from the users that are considered less innocent. the depth of this
measure will be discussed in depth when we get to the literature review section.
There will be different classification algorithms that will be used and the results as
will be illustrated in the performance evaluation chapter. There will be a technical
part that requires the illustration of the code lines that were used for classification.
Therefore, after the executive review which illustrates the findings as per the
machine learning project that is undertaken in this study, there will be the follow up
of an introductory part which is then followed by the literature review section. The
literature review section and the introductory part are interlinked as we have the
listing of the algorithms in the introduction and then the expounding of the areas of
application of the algorithms in the literature review part. After the literature
review, there will be a technical illustration of codes. After which there will be a
follow up of performance evaluation and eventually conclusion.

ANALYTICS 4
Introduction
As it is known there will need to look into the use of up to five classification
algorithms in R that will be used for classification of the tweets datasets that have
been provided. All the algorithms performance will be compared at the performance
evaluation section. The reason for the classification of the set of tweets provided in
the datasets is to get to know which ones are a spammer and which ones are not
spammer tweets. This is similar to the customer churn situation where there are
those customers who are considered to churn against those that are considered not
to churn from the involved service provider. Another close relation to this type of
dataset is the electronic mails that are sent from a sender to a receiver. Mails, just
like in the tweets cases of spammer emails, this is detected by the set algorithms
by any mailing platform in question, be it Email, Hotmail, Gmail and Samsung Mail.
Some mails are not made to serve better intentions are considered suspicious and
this therefore when detected can be classified as spammer mails which the later
can be transferred to the spam folder. As in the case of the tweet, there is clear
evidence that there are only two classes of tweets just as there are two classes of
mails, the spammer and the non-spammer tweets. Therefore, the only way to get to
classify the classes in this case, there will be the use of; decision tree, logistic
regression, Naïve Bayes, Random Forest as well as the K Means model. The Decision
tree can be confusing to an extent as there are up to two types of decision trees.
There is the classification decision tree that gets to classify the binary class of a
variable and the classification decision tree will be used in the classification of the
tweets class variable as the response variable is categorical but though a factor

ANALYTICS 5
variable. The other decision tree is the regression decision tree which is used to
develop the classification on the mostly numeric variable.
In the whole process, the main variable (the main features) that will be used as a
response variable is the tweet_class variable. The feature that will have to be
dropped is the no_tweetfavorites and the reason for this is the fact that it only has
one entry in the cells and therefore there is typically no standard deviation in any
way and therefore this prompts the variable as a less significant variable in a sense
and will not add any value to the performance of the classification algorithms. The
other remaining variables though can be used to the fullest as classification
features. There is the finding that the performance of the classification algorithms
differs considerably in a way and therefore should be considered differently when it
comes to comparison of performance.
As in the executive summary, there will be the literature review after the
introduction which is then later followed by the technical illustration of the
development of the codes and later followed by the performance evaluation. The
performance matrix that will be considered for use, in this case, will be the AUC, the
ROC, the confusion matrix, the true positive rate, the false positive rate, precision,
recall among other measures.
Literature Review
The whole project is entirely based on machine learning and as per the dataset, it
will be a classification-based machine learning type of algorithms. Classification
type of algorithms is supervised types of algorithms as there are unsupervised
types of algorithms as well as the reinforced types of algorithms and the semi-

ANALYTICS 6
supervised types of algorithms. In the supervised types of algorithms, machines and
algorithms that help run the machines are set to train and operate in a specific way.
The set models are monitored from time to time to get to know the results. This is
where the name supervised machine learning originates from; the setup and the
constant watch of the final model. The final results that are obtained, prompt the
frequent change of the developed models as there will be the need of improvement
of the models in a case of lack of better results that are being sorted for.
Like it was mentioned in an earlier section, the customer churn application is the
best application that can be given in business as well as a leaning set up. In the
customer churn scenario, the customers are of key interest and the involved
organizations will be seeking to get the categories under which each customer do
fall. The need for this comes along because there are those customers that are loyal
and there are those customers that are not loyal for various reasons that can make
a customer get to shift from one service provider to another. Organizations mostly,
seek to keep all the customers that had been acquired earlier at all costs as it would
be more expensive to lose and try to acquire new customers. Acquiring new
customers on the side costs a lot of cash to an organization as opposed to
maintaining existing ones. Therefore, through the use of machine learning
algorithms, there can be the classifications that set to classify which customers are
loyal against which ones are not. From here, there will be the execution of the plans
to retain the customers that would rather be lost by other means that were
overseen by the company that might be in question.
The reason as to why machine learning algorithms should be used against the use
of human readers is very evident as human readers cannot work as fast and
tirelessly as the machines that have been trained to mimic the operational

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Machine Learning on Health Tweets Case Study 2022
|25
|4450
|14

Text Analysis on Patients’ Response on WEKA
|19
|3669
|106

Health Response Tweets Analysis for Business Intelligence Analysis 2022
|19
|4213
|12

Assessing and Comparing Classifier Performance
|34
|4614
|24

Health Response Tweets Analysis for Business Discussion 2022
|16
|1560
|14

Application of Machine Learning Assignment 2022
|13
|2705
|17