Data Analytics Report on Network Intrusion Detection using Weka Tools

Verified

Added on 2023/03/31

AI Summary

This report presents an analysis of network intrusion detection using data analytics techniques. The project utilizes the NSL-KDD dataset and the Weka tool to evaluate various classification methods, including decision trees and clustering, for identifying and mitigating network attacks. The report details the implementation of these techniques, feature selection processes, and the creation of testing and training samples. Performance evaluation is conducted using metrics like the confusion matrix, and the limitations of overfitting are discussed. The study also explores ensemble tools and offers recommendations for future work, emphasizing the potential for improved network security through advanced data analysis methods. The report investigates Weka analytics in three stages on network intrusion detection methods and also investigates the network intrusion of cyber security and different comparative analysis, including Weka data analytics.

Data Analytics
Name of the Student:
Register Number:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Description of the Project................................................................................................................................................. 1
Description of Dataset...................................................................................................................................................... 1
Section 1 - Data Analytic Tools and Techniques.................................................................................................................2
Section 2 - Data Analytic for Network Intrusion Detection................................................................................................9
1. Bench Mark Data................................................................................................................................................ 9
2. Select Features.................................................................................................................................................... 9
3. Create Testing and Training Samples.................................................................................................................11
4. Data analytic Techniques................................................................................................................................... 12
5. Network Intrusion............................................................................................................................................. 13
6. Performance of Intrusion Detection - Confusion Matrix....................................................................................13
7. Limitation of over fitting................................................................................................................................... 14
8. Ensemble Tools and its Uses.............................................................................................................................. 14
9. Recommendation.............................................................................................................................................. 14
10. Future Work..................................................................................................................................................... 15
References...................................................................................................................................................................... 15

Description of the Project
This project aims to implementintrusion detection on public data network, with the use of
Weka evaluation. It requires analyzing the Weka dataset to identify a list of attacks which could
follow the user-to-root, remote-to-local and probing on selecting the NSL-KDD data set. Next,
various classification protocols performance is checked on a conflicting network traffic systems,
with the help of the Weka data analytics techniques. For the relationship analysis, the Network
intrusion Stack is utilized by intrusion for creating a conflicting network traffic. By using the
data processing WEKA availability tool instructions, analysis is completed. This help to expose
various facts related to the bonds between the network attacks and algorithms. This report
performs an investigationon Weka analytics, in three stages on network intrusion detection
method. Further, it investigates the network intrusion of cyber security and different comparative
analysis, including Weka data analytics.
Description of Dataset
In Weka analytics, the cyber security on the network intrusion detection on NSL-KDD
dataset is mention in 8 phases. The NSK - KDD data set is suggested for solving few inherent
issues. The NSL -KDD dataset is comparedwith several infiltration detection methods, in terms
of high computational accuracy which contradicts in the network traffic [1]. This must be the
main target of the developing mechanical learning algorithms and intrusion detection systems of
Weka mining tools NSL-KDD dataset file like the following:
1) KDDTrain+.ARFF:
On full NSL-KDD dataset, the data could be defined as an ARFF binary label format in
comparative analysis in Weka mining tools.
2) KDDTrain+.TXT:
Training set data could contain CSV formatting on the data set foridentifying the Attack
type label on the NSL-KDD dataset.
3) KDDTrain+_20Percent.ARFF
The analytics data on the subnet KDDTrain+.ARFF file can be identified as 20 percent on
the mining tool, in the intrusion detection.
4) KDDTrain+_20Percent.TXT
In KDDTrain, the analysis data on the file’s compatible network must identify 20 percent
of mining tool in the internal entry detection.
1

5) KDDTest+.ARFF
Full NSL-KDD test package along with attack type labels and problem level in CSV
Shape of the intrusion detection.
6) KDDTest+.TXT
In mining data analytics, Full NSL-KDD test package along with attack type labels and
problem level in CSV of the classification,then shape the intrusion detection.
7) KDDTest-21.ARFF
In mining data analytics, a subgroup KDDTest Entries does not contain the difficulty
level of 21 out of 21 on the NSL-KDD dataset,in intrusion detection.
8) KDDTest-21.TXT
The subset of KDDTest .txt contains a file Difficulty level posts 21 of 21 on KDD test of
Intrusion detection.
Section 1 - Data Analytic Tools and Techniques
Here, the installation, data analytic site’s analysis and demonstration with a minimum of
two data analysis techniques is done as follows [1]:
 Choose data analysis platform dataset.
 Choose the techniques like data mining and clustering.
 In Weka tool, choose knowledge flow of an analysis dataset .
Begin the process by opening Weka.
2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The following figure shows the step of uploadingthe NSL-KDD dataset, where the Explorer
result is selected.
After uploading the dataset, open the preprocessing tab.
Then, access the file from NSL KDD data. The results looks as follows [2].
On the data technique, network intrusion detection could be utilized in a couple of stages
that such as:
1) Decision tree
2) Cluster mining.
On Weka decision tree algorithm, processing takes place, so go to clarify tab and choose a tree
on J48. This step is illustrated as follows.
Go to the test option, then on the decision tree j48 choose the training data.
3

Number of Leaves: 615
Size of the tree: 714
Time taken to build model: 1.98 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.16 seconds
=== Summary ===
Correctly Classified Instances 22394 99.3346 %
Incorrectly Classified Instances 150 0.6654 %
Kappa statistic 0.9864
Mean absolute error 0.0105
Root mean squared error 0.0725
Relative absolute error 2.142 %
Root relative squared error 14.6356 %
Total Number of Instances 22544
=== Detailed Accuracy by Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.993 0.006 0.992 0.993 0.992 0.986 1.000 1.000 normal
0.994 0.007 0.995 0.994 0.994 0.986 1.000 1.000 anomaly
Weighted Avg. 0.993 0.007 0.993 0.993 0.993 0.986 1.000 1.000
=== Confusion Matrix ===
a b <-- classified as
9643 68 | a = normal
82 12751 | b = anomaly
The following figure shows the decision tree visualization.
Conduct cluster data analysis technique, select cluster tab and choose K means cluster algorithm.
4

In the cluster mode, select the class on K-means clusters.
kMeans
======
Number of iterations: 17
Within cluster sum of squared errors: 53944.67210266422
Initial starting points (random):
Cluster 0:
8205,tcp,telnet,SF,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,255,10,0.04,0.85,0,0,0,0,0.
83,0
Cluster 1:
0,tcp,imap4,RSTO,0,138,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,0.5,1,1,255,37,0.15,0.03,0,0,0
,0,0.42,1
=== Model and evaluation on training set ===
Clustered Instances
0 15292 (68%)
1 7252 (32%)
Class attribute: class
Classes to Clusters:
0 1 <-- assigned to cluster
9477 234 | normal
5815 7018 | anomaly
Cluster 0 <-- normal
Cluster 1 <-- anomaly
Incorrectly clustered instances: 6049.0 26.832 %
Section 2 - Data Analytic for Network Intrusion Detection
1. Bench Mark Data
On intrusion detection, convert Weka analysis on the bench mark for uploading the
benchmark data analysis.
5

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2. Select Features
Go to select attributes tab and select attribute evaluator as follows,
cfs subset eval.
Then, choose the search method as Best First on the instruction detection [3].
The following represents the feature selection’s output.
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 490
Merit of best subset found: 0.435
Attribute Subset Evaluator (supervised, Class (nominal): 42 class):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 5,6,12,25,28,30,31,37,41: 9
src_bytes
dst_bytes
logged_in
serror_rate
srv_rerror_rate
diff_srv_rate
srv_diff_host_rate
dst_host_srv_diff_host_rate
6

dst_host_srv_rerror_rate
3. Create Testing and Training Samples
The following figure represents analytics instruction detection’stest.
4. Data analytic Techniques
Data analysis techniques follows the following methods such as follows:
1) Decision Tree techniques
The following output is evaluated.
1) Correctly Classified Instances 22394 99.3346 %
2) Incorrectly Classified Instances 150 0.6654 %
3) Kappa statistic 0.9864
4) Mean absolute error 0.0105
5) Root mean squared error 0.0725
6) Relative absolute error 2.142 %
7) Root relative squared error 14.6356 %
8) Total Number of Instances 22544
2) Clustering technique
The following output is evaluated.
kMeans
======
 Number of iterations: 17
 Within cluster sum of squared errors: 53944.67210266422
 Initial starting points (random):
 Cluster 0:
8205,tcp,telnet,SF,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,
0,255,10,0.04,0.85,0,0,0,0,0.83,0
 Cluster 1:
0,tcp,imap4,RSTO,0,138,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,0.
5,1,1,255,37,0.15,0.03,0,0,0,0,0.42,1
7

 === Model and evaluation on training set ===
 Clustered Instances
0 15292 (68%)
1 7252 (32%)
5. Network Intrusion
Network infiltration refers to a given sample data. It contains a couple of network
penetration data types, they are:
1) Normal navigation detection as [a]
2) Extraordinary infiltration detection as [b].
6. Performance of Intrusion Detection - Confusion Matrix
This part evaluates the intrusion detection’s performance for a couple of data analytic
techniques as follows:
Decision tree
=== Confusion Matrix ===
a b <-- classified as
9643 68 | a = normal
82 12751 | b = anomaly
Depending on a decision tree, it contains the following,
 Correctly Classified Instances - 99.3346 %
 Incorrectly Classified Instances - 0.6654 %
 Total Number of Instances - 22544
Clustering
Classes forthe clusters are,
0 1 <-- assigned to cluster
9477 234 | normal
5815 7018 | anomaly
Cluster 0 <-- normal
Cluster 1 <-- anomaly
Incorrectly clustered instances: 6049.0 26.832 %
For clustering, a couple of clusters are created like:
1) Cluster 0
8

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2) Cluster 1
Where, cluster 0 denotes a normal intrusion and cluster 1 denotes anomaly instruction.
There contains 26.832% of incorrectly clustered instances.
7. Limitation of over fitting
The top positioning indicate a model that is an effective model for the training
data. When it is applied to a noise sample, the training data details has negatively effect on the
new data model’s performance. But, the issue is that, once it is known that few details are
removed it takes a tree for adjustment.The following list represents a range of matching
materials:
8. Ensemble Tools and its Uses
Algorithms of ensemble tools include:
i. Random Forest
ii. Bagging
iii. AdaBoost
iv. Stacking
v. Voting
9. Recommendation
The top positioning indicate a model that is an effective model for the training data.
When it is applied to a noise sample, the training data details has negatively effect on the new
data model’s performance. But, the issue is that, once it is known that few details are removed it
takes a tree for adjustment.
10. Future Work
The near future research can be carried on the utilization of the easy 5 classifications to
complete the detection of network infiltration, for effectively penetrating to the given database.
Because, it is the well-known data analysis technique, which provides the desired results for a
database that is chosen.
9