MN623: Data Analytics for Network Intrusion Detection using Weka

Verified

Added on 2025/04/10

AI Summary

Desklib provides past papers and solved assignments. This report details data analytics for intrusion detection using Weka.

MN623
Cyber Security and Analytics
Assignment 2
Data analytics for intrusion detection
Student Name:
Student ID:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
Introduction...........................................................................................................................................2
Section 1: Data Analytic tools and techniques.......................................................................................3
Section 2 Data Analytics for Network Intrusion Detection....................................................................6
Conclusion...........................................................................................................................................10
List of Figures
Figure 1: Installation Window 1.............................................................................................................3
Figure 2: Installation Window 2.............................................................................................................3
Figure 3: Installation Window 3.............................................................................................................3
Figure 4: Installation Window 4.............................................................................................................4
Figure 5: Lab Experiment 1....................................................................................................................5
Figure 6: Lab Experiment 2....................................................................................................................5
Figure 7: Clustering of data....................................................................................................................7
Figure 8: Decision Tree Regression(j48).................................................................................................8
Figure 9: Classification of data into training and testing data................................................................8
Figure 10: Confusion Metrix for decision tree Confusion Metrix for decision tree................................9
1

Introduction
In this report network, intrusion detection system data has been used which is given under KD99
dataset. Different classification techniques are used like clustering and decision trees regression is
used. Weka software is being used which is based on Java and can run on any operating system
which has jre environment. Weka software is based on GUI interface and code is not been written,
and we can build, train and test our model without writing code.
2

Section 1: Data Analytic tools and techniques.
1.
Weka is being installed as a data analytical tool. Weka is built on Java and can be installed in any
operating system which has Java installed in it. Some screenshots of installation windows were
added. Weka has to be downloaded from sourceforge.net [1].
Figure 1: Installation Window 1
Figure 2: Installation Window 2
Figure 3: Installation Window 3
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure 4: Installation Window 4
2.
2.1 Clustering:
Clustering techniques are used in data mining. A technique in which data elements are grouped
according to similarity and the difference is called clustering technique. Number of groups will define
number of features in a cluster/group. Pattern Recognition, Data Analysis, Market research and
image processing applications are based on clustering analysis [2].
2.2 Decision Tree Regression:
Decision Tree Regressions are used in business, and they are also known as predictive analysis
techniques. The output predicted is based on contiguous manner, not in a discrete manner.
Regression analysis is best for predictions; there are various types of regression techniques like
linear regression, logistic regression, non- linear regression etc. [3].
3.
In this demonstration iris data set have been used in which 4 attributes are given, we use both
classification and decision trees analytics techniques and confusion matrix is generated.
4

Figure 5: Lab Experiment 1
Figure 6: Lab Experiment 2
5

Section 2 Data Analytics for Network Intrusion Detection
1.
Available data formats for data analytics are:
 Text Files
 Sequence file
 Avro Data Files
 Parquet File Format
Now, let’s discuss this in detail:
Text Files: The text file in a computer is just a file which contents just plain text and is used to save
into the .txt extension. Here in this file special formatting is also allowed like the text can be in bold,
Italic, Underline etc. It can be also said that this file is the digital file on the computer. This file can be
printed in the form of hardcopy [4].
Sequence file: This file is also called as the flat files which contain the binary number that is 1/0 and
these numbers help to the computer which takes the input and output in map reduce which is used
in the data science. They have key-value paired. It consists of the Meta which contains a unique key
and the others may have multiple keys [5].
Avro Data Files; This file is used into the Apache Hadoop which help in the call and data serialization
framework. The data type which is used here is the JSON and also have the binary numbers for the
compact binary format. They have key-value paired [6].
Parquet File Format: This file mainly used into the Hadoop which help to open rationale. The stored
files into the PC. This also has the data structure flat columnar format. This can be used everywhere
in Hadoop. It consists of the Hive, Impala, Pig, Spark. Here the file is column-oriented therefore they
are adjacent and better compression [7].
2.
It performs the two principle function for the line of reasoning. It doesn’t need any kind of creativity.
The function should be tie all together. It should contain the keywords, phrases and the meaningful
word. The logical allows you to give a shape to your thoughts. And helps you to plan the use of
solving the problem with the comprehensive research. Always keep the audiences the mind that it
would be liked by them or not [8].
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

3.
Training and testing data has been created using classifies tab in which we can select the pre-process
the data according to our requirement and select the attributes which will be required for analysis.
After Pre-process of data, we can move to the classify tab in which data can be classified into
training and testing set. 50 % of the data is being classified as a training set and 50% of the data is
being classified as a testing data set.
4.
Two data analytical techniques are used:
i. Clustering:
In Clustering technique, the data attribute name label has twenty-three distinct name and
each name has its unique value. The classification is done through cluster tab. The training
set and testing set contains both 50% partition each. The data set KDD99 contains intrusion
system data which has been used in many hackathons of data analysis. K means clustering
technique is used in this clustering technique with twenty-three clusters.
Figure 7: Clustering of data
ii. Decision Tree Regression:
In this classification technique the data have been evaluated in a tree fashion which have
output in a discrete format and decision tree can be evaluated using classify tab and by
choosing the option of j48 in trees section, then we can select the percentage of partition
we want to give it to our training set and testing set.
7

Figure 8: Decision Tree Regression(j48)
5
Network intrusion data has been classified using classifies tabs and it contains very much instances
of data and data has been classified into various different aspects [9].
Figure 9: Classification of data into training and testing data
6
Confusion matrix is generated for decision tree which is shown below:
8

Figure 10: Confusion Metrix for decision tree Confusion Metrix for decision tree
7.
Overfitting is the analysing technique to the productions on a large scale and the small scales. It finds
the data similar to the particular sets and fixes it into the clusters. Therefore, it helps to predict the
data for the futures and record the observations. Underfitting is also there when there is the static
model capture. The underfitting and overfitting into the machine learning with full support structure.
This useful when there is a little guide theory tend to a large number of models to select from. This
overfitting is used into the selecting model used to judge the suitability of the model. It there is the
same parameters but the observation is greater than model can be trained in such a way to accepts
the potential. As this does not depend on the number of the parameters because it sometimes does
not fit into the relationship which may perform shrinkage relative to the original data. It contents the
unseen data that may encounter the original data. These models are biased in the parameter which
helps to predict the information for the future. This May also have the phenomena overtraining and
undertraining. This Machine Learning feature is that it doesn’t learn anything it only memorizes the
data.
8.
Ensemble methods are used for learning algorithms which create classifiers set and then predict the
value according to the weighted classification points.
9.
In network intrusion detection we can use Kramer’s algorithm in which data which is being used is
types of attack and software through which it's being attacked we can detect and predict that what
could be software through which attack is done, and it will be helpful
10.
In future we can extend these concepts to various applications where data analysis will play an
important role like market research, pattern recognition, business pattern and data analytics will be
helpful in many industries.
9

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Conclusion
In the given report, they have to analyse the given data for which they have used WEKA software
which is a java based application. Here, they have used clustering, regression and decision tree is
used. There is one model known as Overfitting. There are different forms of file formats, for
example, sequence file, text file, Aron file, Parque file. There are logical relational which helps in
sorting the problem which comes in one’s mind.
10

References:
[1]"Weka 3 - Data Mining with Open Source Machine Learning Software in Java", Cs. waikato.ac.nz,
2019. [Online]. Available: https://www.cs.waikato.ac.nz/ml/weka/. [Accessed: 25- Jan- 2019].
[2]"K-Means Clustering in WEKA", Facweb.cs.depaul.edu, 2019. [Online]. Available:
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/k-means.html. [Accessed: 25- Jan-
2019].
[3] L. Giura, "Decision trees explained using Weka | technobium", technobium, 2019. [Online].
Available: http://technobium.com/decision-trees-explained-using-weka/. [Accessed: 25- Jan- 2019].
[4]"Fillable Online WEKA Experimenter Tutorial for Version 3-5-6 David Scuse Peter Reutemann June
1, 2007 c 2002-2007 David Scuse and University of Waikato Contents 1 Introduction 2 2 Standard
Experiments 2 Fax Email Print - PDFfiller", Pdffiller.com, 2019. [Online]. Available:
https://www.pdffiller.com/58434973--WEKA-Experimenter-Tutorial-for-Version-3-5-6-David-Scuse-
Peter-Reutemann-June-1-2007-c-2002-2007-David-Scuse-and-University-of-Waikato-Contents-1-
Introduction-2-2-Standard-Experiments-2-. [Accessed: 25- Jan- 2019].
[5]2019. [Online]. Available:
https://community.hortonworks.com/questions/72576/what-is-the-exact-difference-between-
sequence-file.html [Accessed 25 Jan. 2019]. [Accessed: 25- Jan- 2019].
[6]"HistoryofInformation.com", Historyofinformation.com, 2019. [Online]. Available:
http://www.historyofinformation.com/expanded.php?cat=59&era=0. [Accessed: 25- Jan- 2019].
[7]"Introduction To Parquet File Format with a Parquet Format Example", AcadGild, 2019. [Online].
Available: https://acadgild.com/blog/parquet-file-format-hadoop. [Accessed: 25- Jan- 2019].
[8]"Select the features with rationale. - Google Search", Google.com, 2019. [Online]. Available:
https://www.google.com/search?
client=ubuntu&channel=fs&q=Select+the+features+with+rationale.&ie=utf-8&oe=utf-8. [Accessed:
25- Jan- 2019].
[9] J. Brownlee, "How to Run Your First Classifier in Weka", Machine Learning Mastery, 2019.
[Online]. Available: https://machinelearningmastery.com/how-to-run-your-first-classifier-in-weka/.
[Accessed: 25- Jan- 2019].
11