IT8411 Business Intelligence - Big Data Analysis Report

Verified

Added on 2023/06/15

AI Summary

This report delves into the analysis of big data using various data mining techniques, including classification, clustering, association rule mining, and outlier detection, performed with the WEKA tool. It begins by providing a background on big data, emphasizing its characteristics of volume, velocity, and variety, and introduces biologically inspired data mining concepts like swarm intelligence. The analysis employs a dataset called Airlines.arff, and the report details the data preprocessing steps, followed by an exploration of classification algorithms such as Naive Bayes, Decision Stump, and Multilayer Perceptron, evaluating their accuracy and performance metrics. Clustering techniques, including k-means and Farthest First, are applied to group similar data points, and association rule mining, using algorithms like Apriori and Filtered Associator, identifies relationships between dataset attributes. Outlier detection methods are also implemented to find anomalies within the data. The report concludes with a summary of findings, highlighting the effectiveness of different techniques and suggesting future research directions. Desklib provides access to this and other solved assignments for students.

Lakshmi Prabha
BIG DATA ANALYSIS

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Abstract................................................................................................................................................2
Background of Big data......................................................................................................................3
Background of the Chosen Technique...............................................................................................3
Classification....................................................................................................................................3
Results and Findings...........................................................................................................................5
Classification....................................................................................................................................5
Clustering.........................................................................................................................................8
Association Rule Mining...............................................................................................................10
Outlier Detection...........................................................................................................................12
Summary and Conclusion.................................................................................................................15
Future Work......................................................................................................................................15
References..........................................................................................................................................15

Abstract
Big Data is a broad term for the data sets so large or complex that traditional
applications of data processing are insufficient. For each and every year, the data has been in
increasing rate. Big data can be analysed by using various data mining techniques called
classification, Clustering, Association rules mining and Detection of the outliers. The analysis
of big data is performed in this project using the data mining tool called weka. The
background of big data and biologically inspired data mining is explained in this report. The
analysis is performed using various algorithms of the data mining. The results of the analysis
is explained with the findings and the output screenshots. Summary, conclusion and future
work for this report is provided.

Background of Big data
Big data is a term which is being applied for the datasets in this modern era. The data
sets has occupied a major portion of the world ("Brief History of Big Data", 2017). Big Data
is a broad term for the data sets so large or complex that traditional applications of data
processing are insufficient. For each and every year, the data has been in increasing rate
("Brief History of Big Data", 2017). The current status of the data is increasing in size
ranging from few dozens of Tera-bytes (TB) to many petabytes (PB) of data for each dataset.
The three major features of the big data is comprised with three things. Data Volume, Data
Velocity and Data Variety. Swarm Intelligence is one of the biologically inspired date mining
and it is also considered as an optimization technique (Liu, Zhu, Chen & Yang, 2015). The
techniques in Swarm Intelligence are based on the behaviour of the swarms of bees, insects
and fish schools during the time of searching food (Cambria, Mazzocco & Hussain, 2013).
Data Mining is one of the area in which this biologically inspired algorithms are applied. In
this report, big data analysis is performed using WEKA tool (Kuiler, 2014). In this big data
analysis, various techniques of the data mining is analysed and the results are produced
according to the analysis ("Research papers that changed the world of Big Data", 2017). A
big dataset called Airlines.arff is chosen for doing analysis. The results of the analysis is
being explained with the findings of the algorithms and the output screenshots of the
algorithm.
Background of the Chosen Technique
The Data Mining is comprised of different types of techniques known as Clustering,
Classification, Association Rule Mining and detection of outliers ("What is big data
analytics? - Definition from WhatIs.com", 2017). In this report, the technique chosen is the

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

classification. Classification is called as an supervised learning algorithm that can be used in
the data mining to yield better results.
Classification
Classification is one of the data mining algorithms used for classifying the dataset
according to some priority. It is a supervised learning algorithm. This algorithm is capable of
predicting an output that is dependent on the input. For this process of prediction,
classification algorithm will process the training dataset that is being uploaded in the data
mining tool ("What Data Mining Classification Has to do with Those Offers in your Inbox",
2017). This algorithm is capable of finding the relation between the attributes of the dataset.
The accuracy rate of the prediction is used to determine the goodness of the algorithm. The
objective of the classification algorithm is discovering the way how the attributes in the
dataset reach its solution (Raviya & Gajjar, 2012). There are many number of classification
algorithms. Some of the commonly used algorithms are Naive Bayes, SMO, J48, Decision
Stump and the Multilayer perceptron. The analysis on this classification algorithm can be
expressed in terms of their accuracy, Class, TP Rate, FP Rate, Recall, F-measure, ROC Area
and Precision values. Out of all the classification algorithms, The Multilayer perceptron
classifier is found to be more accurate. Classification algorithms can be used for doing teh
prediction using the dataset. This prediction is used for predicting the output for the input
data. Out of the results of all the classifiers, the experts say that the Multilayer Perceptron is
capable of producing accurate results than the other classifiers (Raviya & Gajjar, 2012).
Naive bayes classifier algorithm is also a commonly used classification algorithm in the data
mining. J48 classification algorithm produces results in the form of making tress with pruned
trees and branches ("What Data Mining Classification Has to do with Those Offers in your
Inbox", 2017). There are number of classifiers present in the weka tool. Each classifier is
capable of predicting various results (Srivastava, 2014).

Results and Findings
Data Preprocessing

Data Preprocessing is the first step that is to be carried out in the data mining weka tool.
This is the step which is used for uploading the dataset into the data mining tool. The dataset
that is chosen for analysis is Airlines.arff which is a large dataset that can be suitable for big
data analysis. The dataset is uploaded and the analysis using various techniques of data
mining in the weka tool is explained in upcoming sections of this report.
Classification
The classifier is built in the data mining tool using the training dataset which is
comprised of tuples of the database and the class labels that are associated with it. The tuples
can be called as samples, data points or objects ("What Data Mining Classification Has to do
with Those Offers in your Inbox", 2017). The below represented screenshots are used to
describe the output of the classification algorithm called Naive bayes classifier (Srivastava,
2014). The naive bayes classifier has taken the Airlines dataset and classified the built model.
It is used for determining the root mean squared error values and the relative absolute error
value.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Decision Stump

Decision Stump algorithm is another tree based classification algorithm. This
classification algorithms is used for making classifications in the form of trees with branches.
It classifies into two types of classification as A and B.
Clustering
Clustering is the process that can be sued in making group of objects into classes of
objects that are similar (Madala, 2017). The first step in the clustering technique is
partitioning the set of data into groups of data which is dependent on similarity of the data
and then the groups can be assigned with certain labels (Neha & Vidyavathi, 2015).
k-means Clustering

K-means clustering is the most widely and commonly used clustering algorithm. The
above screenshots specifies the results of the k-means clustering algorithm (Huang & Su,
2014). The data in the dataset is divided into 2 clusters as cluster 0 and cluster 1. The number
of instances of the cluster are specified. There are 46% of instances in cluster 0 and 54% of
instances in cluster 1. As it is a big dataset, the time taken for the model is about 9.74
seconds. K-means clustering is considered to be most effective clustering algorithm which is
capable of producing better and efficient results than other classification algorithms.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Farthest First Clustering Algorithm
The above screenshots is used to represent the clustering results of the farthest first
clustering algorithm. The results of the farthest first algorithm differs from the results of the
k-means clustering algorithm (Vadeyar, 2014). K-means has 46% in cluster whereas the
farthest first has 56% in the cluster 0 and k-means has 54% in the cluster 1 and farthest first
results has 44%. This shows a 10% variation in the clustering algorithms. The variation in
10% is acceptable because of the varied features of the algorithm (Tiwari, 2012).
Association Rule Mining
The association rule mining is the process of finding the association rules between the
attributes of the dataset. The association between the attributes of the dataset can be

determined in this data mining technique. The objective of the association rule mining is
finding of rules which has the support values and confidence values. It has two steps involved
in it Frequent Item set generation and the rules generation. The item sets are generated with
support >= minimum support value. The rule is generated with high values of confidence
from the generated frequent item sets. Each association rule is called as a binary partitioning
of the generated frequent item sets. The algorithms used for association rule mining in weka
tool are Apriori algorithm, FP growth algorithm and the filtered Associator algorithm. The
top best rules of association are displayed in the output file of the weka.
Apriori Algorithm
The above output screen of the association output is used for the representation of 10
best rules. Apriori algorithm is one of the commonly used algorithm in the association rule
mining technique of data mining. The association between the attributes are represented in
the association rules. The Apriori algorithm is used for producing singletons, pairs and also
triplets of the attributes with association. The generation of candidate rule very slow in using
this algorithm. The runtime of this algorithm is exponentially grown depending on the items
that are generated using this algorithm (Mottalib, Arefin, Islam, Rahman & Abeer, 2011). If
the combination of items is periodic, then the subsets of teh dataset will be frequent sets. If