Data Mining Tools and Techniques

Verified

Added on 2021/06/14

AI Summary

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Data Mining
(Tools and Mining Techniques)

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Index
1.Introduction…………………………………………………………………………..…..1
2.Data set Description….……………………………………………………………..…..1
3.Preprocessing…….…………………………………………………………………..….2
4.Important Attribute...….………………………………………………………………….4
5.Importance of Attributes…...………………………………………………………..…..5
6.Correlation and Regression Analysis..………………………………………………...6
7.Association Algorithm….………………………………………………………………...7
8.Classification and Clustering Algorithm………………………………………………..8
9.Result Analysis...…………………………………………………………………………9
10.Proposed Framework.…………………………………………………………………11
11.Conclusion………………………………………………………………………………14
12.References…………………………………………………………………………...…15

Introduction
Data Mining is the process or techniques to identify informative data from the data
warehouse. Big Data is the coming from different source and collected in different systems.
The main role is to predict the outcomes from the dataset, it also the motive for everyone
for their business and market standard.
There is huge amount of data available in the Data Science Industry. But this data is of no
use until unless we can not find out the informative information from the data. It is
necessary to analyze the bulk amount of data and extract useful infromation from it.
Data Mining is defined the term as extracting information from bulk amount of data. It is the
process of mining the knowledge from the huge data source.
Chronic Kidney Disease dataset is collected through UCI machine learning repository. The
main aim to identify the informative variables or fields from the dataset (S.J. and J.H., 2015).
1. Dataset Description
Chronic Kidney Disease dataset have 25 column. Below is the image of CKD data set with
variables name. In the image two types of data set we have : one is numeric and second is
categorical data. In numeric data we will consider id, age, bp, sg,al,su, bgr, bu, sc, sod, pot,
pcv, wc and rc. Whereas in categorical data we will have rbc, pc, pcc,ba, htn, dm, cad, appet,
pe and ane.
id age bp sg al su rbc pc pcc ba bgr bu sc sod pot hemo pcv wc rc htn dm cad appet pe ane classification
0 48 80 1.02 1 0 normal notpresent notpresent 121 36 1.2 15.4 44 7800 5.2 yes yes no good no no ckd
1 7 50 1.02 4 0 normal notpresent notpresent 18 0.8 11.3 38 6000 no no no good no no ckd
2 62 80 1.01 2 3 normal normal notpresent notpresent 423 53 1.8 9.6 31 7500 no yes no poor no yes ckd
3 48 70 1.005 4 0 normal abnormal present notpresent 117 56 3.8 111 2.5 11.2 32 6700 3.9 yes no no poor yes yes ckd
4 51 80 1.01 2 0 normal normal notpresent notpresent 106 26 1.4 11.6 35 7300 4.6 no no no good no no ckd
5 60 90 1.015 3 0 notpresent notpresent 74 25 1.1 142 3.2 12.2 39 7800 4.4 yes yes no good yes no ckd
6 68 70 1.01 0 0 normal notpresent notpresent 100 54 24 104 4 12.4 36 no no no good no no ckd
7 24 1.015 2 4 normal abnormal notpresent notpresent 410 31 1.1 12.4 44 6900 5 no yes no good yes no ckd
8 52 100 1.015 3 0 normal abnormal present notpresent 138 60 1.9 10.8 33 9600 4 yes yes no good no yes ckd
9 53 90 1.02 2 0 abnormal abnormal present notpresent 70 107 7.2 114 3.7 9.5 29 12100 3.7 yes yes no poor no yes ckd
10 50 60 1.01 2 4 abnormal present notpresent 490 55 4 9.4 28 yes yes no good no yes ckd
11 63 70 1.01 3 0 abnormal abnormal present notpresent 380 60 2.7 131 4.2 10.8 32 4500 3.8 yes yes no poor yes no ckd
12 68 70 1.015 3 1 normal present notpresent 208 72 2.1 138 5.8 9.7 28 12200 3.4 yes yes yes poor yes no ckd
13 68 70 notpresent notpresent 98 86 4.6 135 3.4 9.8 yes yes yes poor yes no ckd
14 68 80 1.01 3 2 normal abnormal present present 157 90 4.1 130 6.4 5.6 16 11000 2.6 yes yes yes poor yes no ckd
15 40 80 1.015 3 0 normal notpresent notpresent 76 162 9.6 141 4.9 7.6 24 3800 2.8 yes no no good no yes ckd
16 47 70 1.015 2 0 normal notpresent notpresent 99 46 2.2 138 4.1 12.6 no no no good no no ckd
17 47 80 notpresent notpresent 114 87 5.2 139 3.7 12.1 yes no no poor no no ckd
18 60 100 1.025 0 3 normal notpresent notpresent 263 27 1.3 135 4.3 12.7 37 11400 4.3 yes yes yes good no no ckd

Dataset Information
As per the dataset we have 24 health related attributes taken in of 400 patients. Out of 400
patients there is 158 patients having complete records but remaining 242 patients having
missing values in the dataset.
age -age
bp – blood pressure
sg – specific gravity
al – albumin
su – sugar
rbc – red blood cells
pc – pus cell
pcc – pus cell clumps
ba – bacteria
bgr – blood clucose random
bu – blood urea
sc – serum creatinine
sod – sodium
pot – potesium
hemo – hemoglobin
pcv – packed cell volume
wc – white blood cell count
htn – hypertension
dm – diabetes mellitus
cad – coronary artery disease
appet – apptite
pe – pedal edema
ane – anemia

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Attribute Information
As per the report we have 24 variables and one class so total 11 numeric and 14 nominal.
Age (numeric) – age in years
Blood pressure (numerical) – bp in mm/Hg
Specific Gravity(nominal) - sg – (1.005,1.010,1.015,1.020,1.025)
Albumin(nominal) - al – (0,1,2,3,4,5)
Sugar(nominal) - su – (0,1,2,3,4,5)
Red Blood Cells(nominal) - rbc – (normal,abnormal)
Pus Cell (nominal) - pc – (normal,abnormal)
Pus Cell clumps(nominal) - pcc – (present,notpresent)
Bacteria(nominal) - ba – (present,notpresent)
Blood Glucose Random(numerical) - bgr in mgs/dl
Blood Urea(numerical) -bu in mgs/dl
Serum Creatinine(numerical) - sc in mgs/dl
Sodium(numerical) - sod in mEq/L
Potassium(numerical) - pot in mEq/L etc.
2. Preprocessing
In data set we have some missing values which can not be classify the model, So we need to
remove missing null data and replace with NaNs in variables bp, sg, al, su, rbc, pc, sod, pot,
pcv and wc (Sanjay & Rani, 2016). These variables have missing null field in the data set so
we replaced null field with NaNs values.

3. Important attribute
From the dataset we found three important variable
bgr( blood clucose random),
rc(red cells) and
wc(white blood cell count)
Chronic Kidney Disease ia s condition in which the kidneys are demaged or con not filter the
blood as well as healthy kidneys. So due to this cause the waste from the blood remain in
the body and it will cause for some health problem (Cox and C., 2010).
From the analysis on the dataset we can say that the 14% people have blood problem.
41% of the data with severely removes kidney, but dialysis are not aware of having Chronic
Kidney Disease.
4. Importance of Attributes
Attributes which we have selected in the report is closely related to each other.
n Attributes
1 Specific Gravity (sg)
2 Albumin (al)
3 Serum creatinine (sc)
4 Hemoglobin (hemo)
5. Correlation and Regression Analysis
id age bp sg al su pc pcc ba ... pcv wc rc htn dm cad appet pe ane classification
0 0 48 80 1.02 1 0 NaN normal notpresent notpresent ... 44 7800 5.2 yes yes no good no ckd
1 1 7 50 1.02 4 0 NaN normal notpresent notpresent ... 38 6000 NaN no no no good no ckd
2 2 62 80 1.01 2 3 normal normal notpresent notpresent ... 31 7500 NaN no yes no poor no ckd
3 3 48 70 1.005 4 0 normal abnormal present notpresent ... 32 6700 3.9 yes no no poor yes ckd
4 4 51 80 1.01 2 0 normal normal notpresent notpresent ... 35 7300 4.6 no no no good no ckd
5 5 60 90 1.015 3 0 NaN NaN notpresent notpresent ... 39 7800 4.4 yes yes no good yes ckd
6 6 68 70 1.01 0 0 NaN normal notpresent notpresent ... 36 NaN NaN no no no good no ckd
7 7 24 NaN 1.015 2 4 normal abnormal notpresent notpresent ... 44 6900 5 no yes no good yes ckd
8 8 52 100 1.015 3 0 normal abnormal present notpresent ... 33 9600 4 yes yes no good no ckd
9 9 53 90 1.02 2 0 abnormal abnormal present notpresent ... 29 12100 3.7 yes yes no poor no ckd
10 10 50 60 1.01 2 4 NaN abnormal present notpresent ... 28 NaN NaN yes yes no good no ckd
rb
c

Corrletion is the statistical techniques that can show weather and how stongly paires of
varibles are related. In the study the variable rbc (red blood cells) and pc (pus cell) is
correlated to each others.
Below graph shows the relationship between all the variables present in the dataset. It
shows the correlation between different predictors values as per their relationship.
From the graph it is clear that the variable which are weekly correlated to each other their
chance is less for disease but stongly correlated variable impact on high disease.
Less correlated > Less chance for disease
Highly correlated < High chance for disease

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

6. Association Algorithm
From the data set we use EM association algorithms which is below :
Expectation Algorithm(EM)
EM algorithm is the clustering techniques which comes under the model which depends
upon the probability distribution. The EM algorithm is types of unsupervised learning with a
set of inputs we have to pass as an traning set. It finds the maximum probabilty which
estimates the parameters in statistical analysis (T.D., & V.C., et al.,2013).
Algorithm : Expectation-Maximization
Input : Our dataset is Chronic Kidney Disease
Outout : Cluster form
Step1: Expectation (E): which creates a function for the expectation of the log-likelihood
evaluated using the current estimate for the parameters, and
X = E [log L (θ|Z)]……………………………. 1

Where, Θ = initial guess for parameter, and Z = missing value .
Step2: Maximization (M): This computes parameters maximizing the anticipated log-
likelihood observed at the E step.
θᶦargmax X………………………………… 2
Use the computed value to obtain better estimates for Θ.
Step3: Iterate E step and M step until converge
7. Classification and Clustering Algorithm
Data set having multiple categorical variables so we use ANN classification algorithms.
The algorithm of the data in CKD used by Principal Component Analysis (PCA).
Principal Component Analysis is a dimentionality reduction techniques in which a
convariance analysis between factors takes place (Medvedev, 2014). The original mining
data is mapped into coordinate system based on the variance within the data.
PCA applies a mathematical procedure for tranforming a number of correlated variables into
a smaller number of uncorrelated variables called principal components.
Algorithm : This algorithm is step out in 5 points :

So using this algorithm we selected threee variables or features (bgr, rc, wc) and visualize
them and running the Principal Componant Analysis with n_components = 2. The PCA will
run twice : one with no scaling and second run for WITH scaling.
Detailed classification report:
precision recall f1-score support
0.0 1.00 1.00 1.00 39
1.0 1.00 1.00 1.00 14
avg / total 1.00 1.00 1.00 53
Confusion Matrix:
[[39 0]
[ 0 14]]
Best
parameters:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

{'class_weight': None, 'max_depth': 2, 'n_estimators': 8,
'random_state': 42}
Artificial Neaural Network (ANN)
“a computing system made up of a number os simple, highly interconnected procesing
elements, which process information by their dynamic state response to external inputs”
(C., 2010), . It is the process of working of human brain to take right decisions.
Algorithm : Artificial Neaural Network(ANN)
Input : The output from EM algorith to ANN
Output : calculate accuracy, precision and recall
8. Result Analysis
In study report we have used Data Mining clustering and classification algorithms which is
Expectation Maximization and Artificial Neaural Network. Using these techniques doctors to
diagnose and can suggest the treatment to the patients. It can also help to patients to get
the real information about the medicine and treatment, also know about the health
conditions.

We have calcuated the accuracy for each data mining algorithm. Now we seen that EM
algorithm is a type of clustering algorithm. From this algorithm we got the accuracy 70%.
Artificial Neaural Network is a classification type of algorithm and the accuracy is about 75%.
So from the analysis of both the algorithm we concludes the results that the ANN accuracy is
better then EM.
So the outcomes from the model is that we can say the prediction from ANN for Chronic
Kidney Disease is more accurate.
Predictive Analysis Models for CKD
In the research study to predict Chronic Kidney Disease, we have to use different analytics
and prediction algorithms. Artificial Neaural Network (NN) for calssification algorithm,
Expectation Maximization (EM) for clustering algorithm. Along with we have to choose
analysis and accuracy techniques : accuracy, precision and recall.
The confusion matrix is also used to get true positive patients who are effected from the
disease in CKD.
Proposed
Framework

The above image is shown the framework of data mining. In the first phase the raw data are
analysed and processed to identify the missing values and outlier from the data set. To deal
with the missing data we have to used two criteria : one if the varialve data is greater than
15% missing data (variable > 15% missing data) removed. Second if the variable has 15% less
missing data. After removing missing values we have to collected similar data set into similar
groups, and this techniques is called clustering. It means the same group of data set came
into same cluster. Based on the cluster we have to pass association rule mining algorith to
predict the future outcomes.
Data Collection
Data Collection is the process to collect data from different different sources and store the
data in central warehouse for futher analysis. Data is collected through audio, video, media
devices, text files, images and black boxes (Ameta & Jain, 2017).
To identify the knowledgeable information from the warehouse data is the main motive for
every industry and organization.
We have Chronic Kidney Disease dataset which have total 400 records with 24 variables
name. The data was collected to predict the future results and to save the cost and provides
better treatments to the patients. The data set includes different disease set which happens
with the people.
Data was collected from Chronic Kidney Disease organization to help petients future
treatments and cost of better results. From the future analysis patients can get better
treatment and medicie and can aware themselves from disease.
Preprocessing
In the preprocessing part we have total 400 records of the patients and out of 400 only 158
patients having complete records and remaing patients have some missing values in their
records. We have to identify the missing variables in the dataset which does not have
sufficient information in the records.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The data cleaning process we have NaNs and numeric features need to be forced to floats.
So we remove all the rows which has values NaNs or deleted those rows which having NaNs
values in the columns.
Here we replace the null values to NaNs in all empty columns
In preprocessing part we have three attributes bgr( blood clucose random),rc(red cells),
wc(white blood cell count) which has most informative fields in the dataset. If we replaced
the empty value to NaNs, so we did not get any changes in the dataset outcomes.
Load the dataset using numpy, pandas and matplotlib library functions in python with using
sklearn modules. It will load the data using below code.
Python function loads the python library module and process the missing dataset. We will
find out the preprocessed data which do not have missing values. This model will replace
the empty values to NaNs and gives the final preprocessed results.
Analysis Factor for Developing CKD
Peoples with diabetes,bood presure or both is the cause of higher risk for developing CKD
than those without disease. From the dataset one more factor is that for disease is heart
disease, obesity and household past history of CKD (Devi, 2014).
Methodology
The clinical data consist of patients records which are had been considered for the analysis
and that dataset is taken from UCI Machine Learning Repository. There are total 25
attributes in dataset. Numerical and Nominal values of attributes are considered. The
variables for Chronic Kidney Disease is shown above.
We are using Data mining and clustering techniques to predict the Chronic Kidney Disease
and its sevarity.
Clustering techniques or algorithm which is using here is Expectation Maximazation (EM),
where the dataset fed as input and the output is fed into Artificial Neaural Network(ANN),
(Devi, 2014).
Objective
The objective is make a cluster similar kind of affected patients records and do further
classify the affected patients records from the variations.
Data Mining Techniques

Clustering is the process to collect similar type of data point or information into one. Or in
simple term can say that grouping the information. It is finding the similar charactristics into
same group.
Classification is for data mining techniques to predict the final results. It is classify the data
with respect to patients disease and make a mining cluster for similar group.
Examine Correlations
CKD stages
Stages pcc (cc/min)
0 >90
1 ≥90
2 60-89
3 30-59
4 15-29
5 <15
Chronic Kidney Disease is the most common disease in the world. It has broad range of
pathophysiological procedures which will be outcomes along with abnormal function of
kidneys.
So from the table it is clear that 6% of adults are in CKD stages 1 and 2. Ans 4.5% of people
are in stage 3 and 4.
Model
Evaluation
Model
evaluation is the
process to get

the data mining results based on the historical data. Also compaired the data with past to
get actual result.
The procedure for an business model is about to understanding the whole flow of data
model. Data Mining is a techniques to extract the information from bulk dataset or an
datawarehouse. The Mining model runs as per the below listed data points.
Business Understanding : Business understaind is about the main thought of the process
what we need to run in the market or which should be our aim to get from the consumers.
Before going to anywhere we need to understand about the business model, what we are
going to do in real market. It is the first phase of business life cycle.
Data Understanding : Data understanding is about the exploring and analysis the data
which have. Which variables can be done for outcomes or not. In this phase we need to
understand about the data, because understanding about the data make sense to predict
the future outcomes.
Data Preparation : Data preparation is to explore and validate the data for business. It is the
flow we prepared accurate data which should be free from missing values and the step is
known as data preprocessing. Data preparation is the main phase for data model to take the
decision on the data future outcomes. A wrong decision will impact the whole flow of the
business (S., and K., 2014).

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Variables the have more then 15% missing data are removed. We can also use the methods
to remove unwanted data from the dataset. Five method we have to use : fixed the dataset,
fixed data with range, random (using normal distributaion), random data (using uniform
distribution) and C and RT algorithms.
Modelling : Modelling is the techniques of ducmenting a complex data system that design
in such a way to understand images, diagram, text and symbols to represent the path to
flow the data. The diagram can be blueprint about the data flow (Kumar & Singh, 2017).
Evaluation: Model Evaluation is an integral part of the model development. It provides the
support to find the best model that represent our data. If the model does not provide the
actual results means the we need to do some evaluation again and correct the model once.
Deployment : Model Deployment is the process to validate the new coming data from
source. It will predict the outcomes for the new data. It is model to deploy in real time basis
to get actual outcomes. When the testing set is run on the model then the result come in
the form of probability and we can take dicision based on the outcomes.
Deployment is an real time application which can provides the prediction outcomes based
on the selected informative variables
Conclusion
This study is based on big data technimilogy that is called data mining in the field of
healthcare. Dara analysis and prediction of Chronic Kidney Disease used data mining
techniques. Expectation Maximization (EM) is the clustering algorithm which is used to
make similer types of clusters of same person’s group. Artificial Neaural Network is the
classfication algorithm which is used to classify the prediction values based on the disease.
The result is calcualted in form of data minnig accuracy. Here Expectation Maximization is
the clustering algorithm and accuracy is about 70% whereas Artificial Neaural Network is the
classification algorithm which is used to classify the disease of the patients and accuracy of
ANN is about 75%. ANN is more accurate to predict the disease of Chronic Kidney Disease
dataset. It is clear that the classification algorithm is used to predict the outcomes for the
disease, it means the patients affected from the disease or not in CKD dataset.

The clustering techniques is used to make similer type of groups to make the affected group
of peoples.
The aim of anaysis for Chronic Kidney Disease is to decrease the cost and provides better
treatment to the patients.
In future the use of clustering and classification is to provides better seamless option to the
patients. We can use the dataset to analyse and calculate the accuracy for future prediction.
References
Lee, S.J., and Jeon, J.H., (2015), Relationship between Symtom Clusters and Quality of Life in
Patients at Stages 2 to 4 Chronic Kidney Disease in Koria. Applied Nursing Research, p: 13-
19.
Bala, S., and Kumar, K., (2014), “A Litrature Review on Kidney Disease Prediction Using Data
Mining Classification Techniques,” International Journal of Computer Science & Mobile
Computing, p: 960-967.
Tangari, N., Kitsios, G.D.,et al.,(2013), “Risk Prediction Models for patients with Chronic
Kidney Disease” Annals of Internet Medicine, p: 596-603.

Hippisley-Cox,J., and Cupland, C.,(2010), “Predicting the Risk of Chronic Kidney Disease in
men and women in England and Wales, p: 11-49.
T.Sanjay , C.M Sheela Rani and K.V Narayana,(2016). Indian Journal of Science and
Technology : “A Survey on Clustering Techniques for Big Data Mining”, p: 465-477.
Harshit Kumar, Nishant Singh,(2017). International Research Journal of Engineering and
Technology : “Review paper on Big Data in healthcare informatics”, p: 341-351.
Viktor Medvedev(2014). “Strategies for Big Data Clustering” IEEE 26th International
Conference on Tools with Artificial Intelligence, p: 4438-4445.
Dr. S. Vijayaran, Mr.S.Dhayanand.(2015). “Kidney Disease prediction using SVM and ANN
algorithms”, International Journal of Computing and Business Research, p: 412-416.
J Chitra Devi(2014). “Binary Decision Tree Classification based onC4.5 and KNN Algorithm for
Banking Application”, International Journal of Computational Intelligence and Informatics, p:
13-25.
Ms. AsthaAmeta,Ms. Kalpana Jain(2017). “Data Mining Techniques for the Prediction of
Kidney Diseases and Treatment: A Review”, International Journal Of Engineering And
Computer Science, p: 20-25.
Noia, T.D., Ostuni, V.C., et al.,(2013), “An End Stage Kidney Disease Predictor Based on
Artificial Neural Networks Ensemble,” Expert Systems with Applications, p: 189-191.

1 out of 19

Your All-in-One AI-Powered Toolkit for Academic Success.

+13062052269

info@desklib.com

Available 24*7 on WhatsApp / Email

Company

Tools

Support