logo

Data Mining for Cardiac Arrhythmia Detection using KNN, Naive Bayes, SVM, Gradient Boosting, Model Tree and Random Forest

   

Added on  2023-06-10

10 Pages4822 Words461 Views
Data Mining
Author:
Abstract
The spot light of this research is on data
mining. The aim of this project is to
leverage the methods learnt in the course
module and to execute data mining study.
This research ensures to distinguish
between the presence and absence of
cardiac arrhythmia and classify it in one
of the 16 groups. Thus, the purpose is to
decrease the differences among the
cardiologs and programs classification.
The literature review in this report acts as
a background to understand various
methods that can help data mining, for
this project. Relatively, various methods’
uses and effects in various studies are
determined. Subsequently, the provided
data set is analysed using the methods like
KNN, Naïve Bayes, SVM, Gradient
Boosting, Model tree and Random Forest.
The methodology used is CRISP-DM
procedure. CRISP-DM procedure is a
well-proven methodology, which is known
for its robustness. It is observed that, the
accuracy of Heart beat based on the
classes is 77.3. In this project, KNN, Naïve
Bayes, SVM, Gradient Boosting, Model
tree and Random Forest methods are all
discussed. The future work is regarded as
finding the heart beat accuracy by using
an online detection method instead of the
used method, with improvement in time
and budget.
Keywords: KNN, SVM, CRISP-DM,
Random Forest, Gradient Boosting, Model
tree, Naïve Bayes, Heart beat, cardiac
arrhythmia
1. Introduction
This research ensures to distinguish between
the presence and absence of cardiac
arrhythmia and classify it in one of the 16
groups. The research questions are related to
the accuracy of heat beats. The methodology
utilized for this project includes, CRISP-DM
procedure, which is a well-proven
methodology and known for its robustness.
Here, the provided data set is analysed using
the methods like KNN, Naïve Bayes, SVM,
Gradient Boosting, Model tree and Random
Forest. The database comprises of 279
attributes, out of which 206 are linear valued
and the remaining are nominal. The
Class 01 denotes 'normal' then the ECG
classes ranging from 02 to 15 denotes
different classes of arrhythmia and Class 16
denote the remaining unclassified ones.
Current there exists a computer program
which makes all these classifications. But,
there exists some differences amongst the
cardiologs and programs classification. In
this case, the cardiologs are taken as a gold
standard, for decreasing such difference,
with the help of machine learning tools.
Aim
As a whole, this project mainly aims to
leverage the methods learnt in the course
module and to execute a significant data
1

mining study. The provided data set is used
as a source, for this project.
Objective
The objective of this project is to distinguish
between the presence and absence of cardiac
arrhythmia and classify it in one of the 16
groups. Thus, the purpose is to decrease the
differences among the cardiologs and
programs classification.
2. Literature Review
According to [1], it is believed that for
predicting the project’s disappointment risk
or scheme several models are utilized and
one of them includes Naïve Bayes. The
Naive Bayes prototype was used for creating
the confusion matrix. Here, the result were
compare and the result showed the
dissimilar quantity than the inhabitants
which are utilized for scoring the set and it
represented that the validation must be
corrected. The only reason for selecting
Naïve Bayes is its capacity of handling the
missing data that are useful for projects like
CSI, where the data goes missing every now
and then. However, selection of Naïve
Bayes contains a dualistic approach such as,
the missing information get seized and it
helps to derive simple probabilistic classifier
based on Bayes assumptions. It is observed
that the results of Naïve Bayes can be often
inaccurate, but its performance in the
organizational activities are good. However,
in certain cases, poor calibration of Naïve
Bayes is possible. The performance is
satisfactory due to highly interdependent
probability with the factual potentials. When
they are largely interdependent with each
other, this lets to instantly calculate large
samples and allows to indirectly handle the
data with the missing interdependent
element.
As per [2], KNN algorithm is referred as a
simplest machine learning algorithms, which
idealizes that, the objects which closer to
each other will contain same characteristics.
Therefore based on the characteristic
features of the nearby objects, the nearest
neighbor is predicted. Generally, KNN deals
with continuous attributes, but it can also
work with the discrete attributes. While the
discrete attributes are dealt, if the attribute
values for the two instances a2, b2 are
different thus, the difference among them is
equivalent to one, if not it is equivalent to
zero. According to the results found related
KNN shows the sensitivity, specificity, and
accuracy, for diagnosis of patients with heart
diseases. The value of K ranges between 1
to 13 and the achieved accuracy ranges from
94 percent to 97.4 percent, which contains
different K values. The value of K equal to 7
achieved the highest accuracy and
specificity (97.4% and 99% respectively).
This paper are shows that the KNN is the
widely utilized data mining method, for the
classification problems. On the other hand,
KNN’s simplicity is regarded as best. It is
considered to have relatively high
convergence speed, which makes it famous
to select. Further, the major demerit of the
KNN classifiers is the requirement of large
memory, for storing the complete sample.
When the sample is large, response time on
a sequential computer is also large. KNN is
used to find the closest neighbors of the
given data with all the available training
data. In this paper, if a label is found then
the algorithm quits, otherwise the system
2

classifier is applied. The proposed algorithm
was used to recognize the object. The results
are compared to those obtained with single
system classifier and KNN.
According to [3], in this paper the authors
ensures that, SVM's effectiveness is
analyzed, where the medical dataset
classifying (Heart disease classification) is
done with the classification techniques.
Further, even the Naïve Bayes classifier,
RBF network and SVM Classifier’s
performance is analyzed. With respective to
SVM, the observation proves that it can
generate effective accuracy level in
classification. Here, the authors have used
WEKA environment for retrieving the
results. Especially for medical dataset the
SVM classifier results prove to be robust
and effective too.
The authors in [4], have concluded that
bagging works fine for most of the decision
tree types but needs some tuning. Whereas
the neutral nets and the SVMs need careful
selection of parameters. In terms of Random
Forest, boosting, SVM and other methods
showed significant performance for the
STATLOG data set.
It is stated in [5] that, the Model trees are
referred as a type of decision tree, which at
the leaves has functions of linear regression.
It is considered as a successful method to
predict continuous numeric values.
3. Data Mining Methodology
The general methodology used for this
project includes CRISP-DM (Cross-Industry
Process for Data Mining) procedure. This
process helps to construct a structured
method for planning a data mining project.
Because, it is an effective process which is
robust and is termed as a well-proven
methodology. This model's objective is to
set the objectives, produce a project plan and
lay out the business success criteria. The
below figure represents the CRISP-DM
model [6].
Figure: CRISP-DM model
The CRISP-DM procedure includes six
main stages such as [6],
1) Knowledge of business
The business knowledge aims to understand
the business’s requirements and vision. Then
it accordingly modifies the elements to
description of information mining questions,
then it develops a plan to meet the
organization’s aim.
2) Knowledge of information
This stage the data which is unstructured is
utilized for understanding the information
that is hidden from the data, where the
hypothesis is assumed and it recognizes
information’s superiority that the data
contains.
3) Data preparation
This stage contains data selection for
analysis, cleaning the data, constructing
the required data and integrating the data.
4) Model designing
Once the existing subgroups are
understood, the model is ready for
3

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Data Mining: A Solution for Business Problems
|7
|1117
|413