Data Mining Case Study 2022

Verified

Added on  2022/09/28

|25
|1821
|23
AI Summary
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running Head: DATA 1
Data Mining
Name of Student
Date
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
DATA 2
Classification Performance Evaluation
In this part, there will be a comparison of the performance of different classification models on
three datasets. What this means is that all the five models are run on one dataset and then we
check on the actual performance all the models on every dataset.
A. Iris Dataset
We start to think of the statistical alignment of the dataset and the graphs and the results of the
descriptive statistics per variable.
figure 1
From figure 1 it is very clear to see the actual instances under consideration are 150 and a total of
5 variables.
figure 2
From figure 2 above it is clear to see that the name of the variable as sepal length and the number
of distinct values sum up to 35 in total. The variable type is numeric and no cell has a missing
entry. The constants that are recorded are the mean, maximum, minimum, standard deviation
Document Page
DATA 3
values (Eldén, 2019). The proceeding figures will give all the descriptive statistics for the
proceeding variable case.
figure 3
figure 4
figure 5
figure 6
Document Page
DATA 4
figure 7
Figure 7 shows all the graphical alignment of each variable both numeric and nominal.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
DATA 5
figure 8
Figure 9 gives the scatter plot for variables and there are a total of three variables and the clusters
that they develop are in terms of the variables.
Document Page
DATA 6
a. Algorithms Testing
i. Multilayer Perceptron
ii. Naive Bayes
Document Page
DATA 7
iii. J48
iv. Random Forest
v. RERTree
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA 8
vi. Models comparison
From all the screenshots that have been provided, there are the summary statistics that give
correctly classified instances. The more the number of correctly classified instances, the higher
the percentage and the higher the percentage the better the model of classification. Therefore, on
Iris dataset, starting from the first classification models we have; 97%, 96%, 96%, 95% and 94%.
Therefore, the best model as per this will be the Multilayer Perceptron and REP tree is a poorer
performer in all the models. This can be seen as well on the confusion matrix as there are fewer
instances that are incorrectly classified (Jabez, Gowri, Vigneshwari, Mayan & Srinivasulu,
2019).
Document Page
DATA 9
B. Breast Cancer
In this case, there will be a look into the descriptive statistics in a similar way as it was done in
the Iris dataset above. Each figure gives the statistics as per variable under the variable name.
There is a total of 10 variables
figure 1
figure 2
figure 3
Document Page
DATA 10
figure 4
figure 5
figure 6
figure 7
figure 8
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
DATA 11
Figure 9
figure 10
figure 11
Document Page
DATA 12
b. Algorithms Testing
i. Multilayer Perceptron
ii. Naïve Bayes
Document Page
DATA 13
iii. J48
iv. Random Forest
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA 14
v. REP tree
vi. Algorithms comparison
According to the percentage performance of all the models that we have tested on the breast
cancer dataset, starting with the first to be tested, we have; 64.6853%, 71.6783%, 75.5245%,
69.5804%, 70.6294%. These results give us the decision trees algorithms as the best performer
and Multilayer Perceptron is the poorest performing model (Kumar, Koushik & Deepak, 2018).
C. Diabetes
We start with the descriptive statistics alignment per variable and this is at the figure state.
figure 1
figure 2
Document Page
DATA 15
figure 3
figure 4
figure 5
figure 6
figure 7
figure 8
Document Page
DATA 16
figure 9
figure 10
Figure 11 below, gives the scatter plot and then figure 12 gives the histogram distribution
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
DATA 17
figure 11
Document Page
DATA 18
figure 12
c. Algorithms Testing
i. Multilayer Perception
Document Page
DATA 19
ii. Naïve Bayes
iii. J48
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA 20
iv. Random Forest
v. REP tree
From the model's comparisons, the best performing model is Naïve Bayes and the lest
performing model is J48. This is seen as the values from the very first model that is tested is;
75.3906%, 76.3021%, 73.8281%, 75.7813, 75.2604% (Lang, Bravo-Marquez, Beckham, Hall &
Frank, 2019).
Document Page
DATA 21
Data Mining Applications (Task 2)
Data mining is the extraction of data and the use of the extracted data for the purposes that can
aid in the decision making and the actual running of a company. Therefore, scientifically, the real
definition of data mining is the process that involves finding anomalies and correlation patterns
over the large datasets or a large dataset under scrutiny. The reason as to why all of that is done
is to aid in the classification and clustering of datasets. These activities, in the long run, that is
the clustering and the classification done on the datasets only help in different purposes.
a. Purposes of Data Mining
Their reason data mining is used in most of the organizations in different industries is related to
organizational advancement. Data mining improves the relationship between customers and an
organization as well as the customers themselves. The next importance of data mining is an
improvement in production volumes hence bringing an improvement in the revenues earned in
case all the produced amount is sold and this surely brings about an improvement in profits
hence company advancement (Malik, Abdallah, & Ala’raj, 2018).
b. Data Mining Stages
There are four stages of data mining and these are the data source stage, then the data exploration
and preprocessing stage, the third stage is the algorithm build upstage and the final stage that is
to bring about improvement in production is the deployment stage. Looking at the stages in an
exploratory manner, we have that data source stage is that stage where data is extracted from all
the sources and specifically all the databases that are available for extracting data from. The
second stage is the data preprocessing stage, at this stage data must always be made better for
consumption. By consumption, what is meant data is meant better for exploration in the analysis
Document Page
DATA 22
stage. The third stage is the stage where we build algorithms that are used for machine learning
and that aid in the production processes that are there in the company. Once the built model has
been made excellent for production then there will be the deployment of the said algorithm that
has since been built (Souri & Hosseini, 2018).
Of the parameters that have been employed are being employed in data mining, we have;
classification, sequence analysis, clustering and forecasting. Classification and patterns aid in the
predictions of patterns that are there in the whole datasets that are being analyzed. Forecasting on
the hand once classification and clustering groups are all built up, then there can be the
prediction of what is expected to happen in the future and this is what called forecasting (Wang,
Ji, Liu, Wang, Weng, Deng & Yuan, 2018).
c. Data Mining Application
Data mining as is known is majorly used in business processes, it can either be in the invention
of a new product or just the retaining of customers that want to leave a line of good or service
production. The next is the increase in the amount of a good being produced.
The main focus, in this case, will be the invention of a new product. The automobile companies
that exist, made sure that they had a look at the extent at which the combustion of fossil fuels by
car engines caused degradation to the environment over time. It was realized that the results of
burning fossil fuel were that there was a real degradation of the environment. One way in which
the environment has been affected over time is through global warming and through polluted air
that reduced the life span of individuals when inhaled. This prompted most of the innovators to
think of an alternative means, based on the dataset that had been mined and analyzed, to move to
more environmentally friendly approaches of cars mobility. There followed the development of
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
DATA 23
electrically mobile cars. This for sure brought about a change and ensured that there was a real
deduction in pollution caused by fossil fuels as there was green energy provided to the cars. By
the purchase of most of these cars, there was, therefore, going to be a sure reduction of the
production of the environment since electric power does not produce fuel that pollutes the
environment (Yee, Sagadevan & Malim, 2018).
Document Page
DATA 24
References
Eldén, L. (2019). Matrix methods in data mining and pattern recognition (Vol. 15). SIAM.
Jabez, J., Gowri, S., Vigneshwari, S., Mayan, J. A., & Srinivasulu, S. (2019). Anomaly Detection
by Using CFS Subset and Neural Network with WEKA Tools. In Information and
Communication Technology for Intelligent Systems (pp. 675-682). Springer, Singapore.
Kumar, M. N., Koushik, K. V. S., & Deepak, K. (2018). Prediction of Heart Diseases Using Data
Mining and Machine Learning Algorithms and Tools.
Lang, S., Bravo-Marquez, F., Beckham, C., Hall, M., & Frank, E. (2019). Wekadeeplearning4j:
A deep learning package for weka based on deeplearning4j. Knowledge-Based Systems,
178, 48-50.
Malik, M. M., Abdallah, S., & Ala’raj, M. (2018). Data mining and predictive analytics
applications for the delivery of healthcare services: a systematic literature review. Annals
of Operations Research, 270(1-2), 287-312.
Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches
using data mining techniques. Human-centric Computing and Information Sciences, 8(1),
3.
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., ... & Yuan, C. A. (2018). Review on
mining data from multiple data sources. Pattern Recognition Letters, 109, 120-128.
Yee, O. S., Sagadevan, S., & Malim, N. H. A. H. (2018). Credit card fraud detection using
machine learning as a data mining technique. Journal of Telecommunication, Electronic
and Computer Engineering (JTEC), 10(1-4), 23-27.
Document Page
DATA 25
chevron_up_icon
1 out of 25
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]