Data Mining Case Study 2022
VerifiedAdded on 2022/09/28
|25
|1821
|23
AI Summary
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Running Head: DATA 1
Data Mining
Name of Student
Date
Data Mining
Name of Student
Date
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
DATA 2
Classification Performance Evaluation
In this part, there will be a comparison of the performance of different classification models on
three datasets. What this means is that all the five models are run on one dataset and then we
check on the actual performance all the models on every dataset.
A. Iris Dataset
We start to think of the statistical alignment of the dataset and the graphs and the results of the
descriptive statistics per variable.
figure 1
From figure 1 it is very clear to see the actual instances under consideration are 150 and a total of
5 variables.
figure 2
From figure 2 above it is clear to see that the name of the variable as sepal length and the number
of distinct values sum up to 35 in total. The variable type is numeric and no cell has a missing
entry. The constants that are recorded are the mean, maximum, minimum, standard deviation
Classification Performance Evaluation
In this part, there will be a comparison of the performance of different classification models on
three datasets. What this means is that all the five models are run on one dataset and then we
check on the actual performance all the models on every dataset.
A. Iris Dataset
We start to think of the statistical alignment of the dataset and the graphs and the results of the
descriptive statistics per variable.
figure 1
From figure 1 it is very clear to see the actual instances under consideration are 150 and a total of
5 variables.
figure 2
From figure 2 above it is clear to see that the name of the variable as sepal length and the number
of distinct values sum up to 35 in total. The variable type is numeric and no cell has a missing
entry. The constants that are recorded are the mean, maximum, minimum, standard deviation
DATA 3
values (Eldén, 2019). The proceeding figures will give all the descriptive statistics for the
proceeding variable case.
figure 3
figure 4
figure 5
figure 6
values (Eldén, 2019). The proceeding figures will give all the descriptive statistics for the
proceeding variable case.
figure 3
figure 4
figure 5
figure 6
DATA 4
figure 7
Figure 7 shows all the graphical alignment of each variable both numeric and nominal.
figure 7
Figure 7 shows all the graphical alignment of each variable both numeric and nominal.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
DATA 5
figure 8
Figure 9 gives the scatter plot for variables and there are a total of three variables and the clusters
that they develop are in terms of the variables.
figure 8
Figure 9 gives the scatter plot for variables and there are a total of three variables and the clusters
that they develop are in terms of the variables.
DATA 6
a. Algorithms Testing
i. Multilayer Perceptron
ii. Naive Bayes
a. Algorithms Testing
i. Multilayer Perceptron
ii. Naive Bayes
DATA 7
iii. J48
iv. Random Forest
v. RERTree
iii. J48
iv. Random Forest
v. RERTree
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
DATA 8
vi. Models comparison
From all the screenshots that have been provided, there are the summary statistics that give
correctly classified instances. The more the number of correctly classified instances, the higher
the percentage and the higher the percentage the better the model of classification. Therefore, on
Iris dataset, starting from the first classification models we have; 97%, 96%, 96%, 95% and 94%.
Therefore, the best model as per this will be the Multilayer Perceptron and REP tree is a poorer
performer in all the models. This can be seen as well on the confusion matrix as there are fewer
instances that are incorrectly classified (Jabez, Gowri, Vigneshwari, Mayan & Srinivasulu,
2019).
vi. Models comparison
From all the screenshots that have been provided, there are the summary statistics that give
correctly classified instances. The more the number of correctly classified instances, the higher
the percentage and the higher the percentage the better the model of classification. Therefore, on
Iris dataset, starting from the first classification models we have; 97%, 96%, 96%, 95% and 94%.
Therefore, the best model as per this will be the Multilayer Perceptron and REP tree is a poorer
performer in all the models. This can be seen as well on the confusion matrix as there are fewer
instances that are incorrectly classified (Jabez, Gowri, Vigneshwari, Mayan & Srinivasulu,
2019).
DATA 9
B. Breast Cancer
In this case, there will be a look into the descriptive statistics in a similar way as it was done in
the Iris dataset above. Each figure gives the statistics as per variable under the variable name.
There is a total of 10 variables
figure 1
figure 2
figure 3
B. Breast Cancer
In this case, there will be a look into the descriptive statistics in a similar way as it was done in
the Iris dataset above. Each figure gives the statistics as per variable under the variable name.
There is a total of 10 variables
figure 1
figure 2
figure 3
DATA 10
figure 4
figure 5
figure 6
figure 7
figure 8
figure 4
figure 5
figure 6
figure 7
figure 8
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
DATA 11
Figure 9
figure 10
figure 11
Figure 9
figure 10
figure 11
DATA 12
b. Algorithms Testing
i. Multilayer Perceptron
ii. Naïve Bayes
b. Algorithms Testing
i. Multilayer Perceptron
ii. Naïve Bayes
DATA 13
iii. J48
iv. Random Forest
iii. J48
iv. Random Forest
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
DATA 14
v. REP tree
vi. Algorithms comparison
According to the percentage performance of all the models that we have tested on the breast
cancer dataset, starting with the first to be tested, we have; 64.6853%, 71.6783%, 75.5245%,
69.5804%, 70.6294%. These results give us the decision trees algorithms as the best performer
and Multilayer Perceptron is the poorest performing model (Kumar, Koushik & Deepak, 2018).
C. Diabetes
We start with the descriptive statistics alignment per variable and this is at the figure state.
figure 1
figure 2
v. REP tree
vi. Algorithms comparison
According to the percentage performance of all the models that we have tested on the breast
cancer dataset, starting with the first to be tested, we have; 64.6853%, 71.6783%, 75.5245%,
69.5804%, 70.6294%. These results give us the decision trees algorithms as the best performer
and Multilayer Perceptron is the poorest performing model (Kumar, Koushik & Deepak, 2018).
C. Diabetes
We start with the descriptive statistics alignment per variable and this is at the figure state.
figure 1
figure 2
DATA 15
figure 3
figure 4
figure 5
figure 6
figure 7
figure 8
figure 3
figure 4
figure 5
figure 6
figure 7
figure 8
DATA 16
figure 9
figure 10
Figure 11 below, gives the scatter plot and then figure 12 gives the histogram distribution
figure 9
figure 10
Figure 11 below, gives the scatter plot and then figure 12 gives the histogram distribution
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
DATA 17
figure 11
figure 11
DATA 18
figure 12
c. Algorithms Testing
i. Multilayer Perception
figure 12
c. Algorithms Testing
i. Multilayer Perception
DATA 19
ii. Naïve Bayes
iii. J48
ii. Naïve Bayes
iii. J48
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
DATA 20
iv. Random Forest
v. REP tree
From the model's comparisons, the best performing model is Naïve Bayes and the lest
performing model is J48. This is seen as the values from the very first model that is tested is;
75.3906%, 76.3021%, 73.8281%, 75.7813, 75.2604% (Lang, Bravo-Marquez, Beckham, Hall &
Frank, 2019).
iv. Random Forest
v. REP tree
From the model's comparisons, the best performing model is Naïve Bayes and the lest
performing model is J48. This is seen as the values from the very first model that is tested is;
75.3906%, 76.3021%, 73.8281%, 75.7813, 75.2604% (Lang, Bravo-Marquez, Beckham, Hall &
Frank, 2019).
DATA 21
Data Mining Applications (Task 2)
Data mining is the extraction of data and the use of the extracted data for the purposes that can
aid in the decision making and the actual running of a company. Therefore, scientifically, the real
definition of data mining is the process that involves finding anomalies and correlation patterns
over the large datasets or a large dataset under scrutiny. The reason as to why all of that is done
is to aid in the classification and clustering of datasets. These activities, in the long run, that is
the clustering and the classification done on the datasets only help in different purposes.
a. Purposes of Data Mining
Their reason data mining is used in most of the organizations in different industries is related to
organizational advancement. Data mining improves the relationship between customers and an
organization as well as the customers themselves. The next importance of data mining is an
improvement in production volumes hence bringing an improvement in the revenues earned in
case all the produced amount is sold and this surely brings about an improvement in profits
hence company advancement (Malik, Abdallah, & Ala’raj, 2018).
b. Data Mining Stages
There are four stages of data mining and these are the data source stage, then the data exploration
and preprocessing stage, the third stage is the algorithm build upstage and the final stage that is
to bring about improvement in production is the deployment stage. Looking at the stages in an
exploratory manner, we have that data source stage is that stage where data is extracted from all
the sources and specifically all the databases that are available for extracting data from. The
second stage is the data preprocessing stage, at this stage data must always be made better for
consumption. By consumption, what is meant data is meant better for exploration in the analysis
Data Mining Applications (Task 2)
Data mining is the extraction of data and the use of the extracted data for the purposes that can
aid in the decision making and the actual running of a company. Therefore, scientifically, the real
definition of data mining is the process that involves finding anomalies and correlation patterns
over the large datasets or a large dataset under scrutiny. The reason as to why all of that is done
is to aid in the classification and clustering of datasets. These activities, in the long run, that is
the clustering and the classification done on the datasets only help in different purposes.
a. Purposes of Data Mining
Their reason data mining is used in most of the organizations in different industries is related to
organizational advancement. Data mining improves the relationship between customers and an
organization as well as the customers themselves. The next importance of data mining is an
improvement in production volumes hence bringing an improvement in the revenues earned in
case all the produced amount is sold and this surely brings about an improvement in profits
hence company advancement (Malik, Abdallah, & Ala’raj, 2018).
b. Data Mining Stages
There are four stages of data mining and these are the data source stage, then the data exploration
and preprocessing stage, the third stage is the algorithm build upstage and the final stage that is
to bring about improvement in production is the deployment stage. Looking at the stages in an
exploratory manner, we have that data source stage is that stage where data is extracted from all
the sources and specifically all the databases that are available for extracting data from. The
second stage is the data preprocessing stage, at this stage data must always be made better for
consumption. By consumption, what is meant data is meant better for exploration in the analysis
DATA 22
stage. The third stage is the stage where we build algorithms that are used for machine learning
and that aid in the production processes that are there in the company. Once the built model has
been made excellent for production then there will be the deployment of the said algorithm that
has since been built (Souri & Hosseini, 2018).
Of the parameters that have been employed are being employed in data mining, we have;
classification, sequence analysis, clustering and forecasting. Classification and patterns aid in the
predictions of patterns that are there in the whole datasets that are being analyzed. Forecasting on
the hand once classification and clustering groups are all built up, then there can be the
prediction of what is expected to happen in the future and this is what called forecasting (Wang,
Ji, Liu, Wang, Weng, Deng & Yuan, 2018).
c. Data Mining Application
Data mining as is known is majorly used in business processes, it can either be in the invention
of a new product or just the retaining of customers that want to leave a line of good or service
production. The next is the increase in the amount of a good being produced.
The main focus, in this case, will be the invention of a new product. The automobile companies
that exist, made sure that they had a look at the extent at which the combustion of fossil fuels by
car engines caused degradation to the environment over time. It was realized that the results of
burning fossil fuel were that there was a real degradation of the environment. One way in which
the environment has been affected over time is through global warming and through polluted air
that reduced the life span of individuals when inhaled. This prompted most of the innovators to
think of an alternative means, based on the dataset that had been mined and analyzed, to move to
more environmentally friendly approaches of cars mobility. There followed the development of
stage. The third stage is the stage where we build algorithms that are used for machine learning
and that aid in the production processes that are there in the company. Once the built model has
been made excellent for production then there will be the deployment of the said algorithm that
has since been built (Souri & Hosseini, 2018).
Of the parameters that have been employed are being employed in data mining, we have;
classification, sequence analysis, clustering and forecasting. Classification and patterns aid in the
predictions of patterns that are there in the whole datasets that are being analyzed. Forecasting on
the hand once classification and clustering groups are all built up, then there can be the
prediction of what is expected to happen in the future and this is what called forecasting (Wang,
Ji, Liu, Wang, Weng, Deng & Yuan, 2018).
c. Data Mining Application
Data mining as is known is majorly used in business processes, it can either be in the invention
of a new product or just the retaining of customers that want to leave a line of good or service
production. The next is the increase in the amount of a good being produced.
The main focus, in this case, will be the invention of a new product. The automobile companies
that exist, made sure that they had a look at the extent at which the combustion of fossil fuels by
car engines caused degradation to the environment over time. It was realized that the results of
burning fossil fuel were that there was a real degradation of the environment. One way in which
the environment has been affected over time is through global warming and through polluted air
that reduced the life span of individuals when inhaled. This prompted most of the innovators to
think of an alternative means, based on the dataset that had been mined and analyzed, to move to
more environmentally friendly approaches of cars mobility. There followed the development of
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
DATA 23
electrically mobile cars. This for sure brought about a change and ensured that there was a real
deduction in pollution caused by fossil fuels as there was green energy provided to the cars. By
the purchase of most of these cars, there was, therefore, going to be a sure reduction of the
production of the environment since electric power does not produce fuel that pollutes the
environment (Yee, Sagadevan & Malim, 2018).
electrically mobile cars. This for sure brought about a change and ensured that there was a real
deduction in pollution caused by fossil fuels as there was green energy provided to the cars. By
the purchase of most of these cars, there was, therefore, going to be a sure reduction of the
production of the environment since electric power does not produce fuel that pollutes the
environment (Yee, Sagadevan & Malim, 2018).
DATA 24
References
Eldén, L. (2019). Matrix methods in data mining and pattern recognition (Vol. 15). SIAM.
Jabez, J., Gowri, S., Vigneshwari, S., Mayan, J. A., & Srinivasulu, S. (2019). Anomaly Detection
by Using CFS Subset and Neural Network with WEKA Tools. In Information and
Communication Technology for Intelligent Systems (pp. 675-682). Springer, Singapore.
Kumar, M. N., Koushik, K. V. S., & Deepak, K. (2018). Prediction of Heart Diseases Using Data
Mining and Machine Learning Algorithms and Tools.
Lang, S., Bravo-Marquez, F., Beckham, C., Hall, M., & Frank, E. (2019). Wekadeeplearning4j:
A deep learning package for weka based on deeplearning4j. Knowledge-Based Systems,
178, 48-50.
Malik, M. M., Abdallah, S., & Ala’raj, M. (2018). Data mining and predictive analytics
applications for the delivery of healthcare services: a systematic literature review. Annals
of Operations Research, 270(1-2), 287-312.
Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches
using data mining techniques. Human-centric Computing and Information Sciences, 8(1),
3.
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., ... & Yuan, C. A. (2018). Review on
mining data from multiple data sources. Pattern Recognition Letters, 109, 120-128.
Yee, O. S., Sagadevan, S., & Malim, N. H. A. H. (2018). Credit card fraud detection using
machine learning as a data mining technique. Journal of Telecommunication, Electronic
and Computer Engineering (JTEC), 10(1-4), 23-27.
References
Eldén, L. (2019). Matrix methods in data mining and pattern recognition (Vol. 15). SIAM.
Jabez, J., Gowri, S., Vigneshwari, S., Mayan, J. A., & Srinivasulu, S. (2019). Anomaly Detection
by Using CFS Subset and Neural Network with WEKA Tools. In Information and
Communication Technology for Intelligent Systems (pp. 675-682). Springer, Singapore.
Kumar, M. N., Koushik, K. V. S., & Deepak, K. (2018). Prediction of Heart Diseases Using Data
Mining and Machine Learning Algorithms and Tools.
Lang, S., Bravo-Marquez, F., Beckham, C., Hall, M., & Frank, E. (2019). Wekadeeplearning4j:
A deep learning package for weka based on deeplearning4j. Knowledge-Based Systems,
178, 48-50.
Malik, M. M., Abdallah, S., & Ala’raj, M. (2018). Data mining and predictive analytics
applications for the delivery of healthcare services: a systematic literature review. Annals
of Operations Research, 270(1-2), 287-312.
Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches
using data mining techniques. Human-centric Computing and Information Sciences, 8(1),
3.
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., ... & Yuan, C. A. (2018). Review on
mining data from multiple data sources. Pattern Recognition Letters, 109, 120-128.
Yee, O. S., Sagadevan, S., & Malim, N. H. A. H. (2018). Credit card fraud detection using
machine learning as a data mining technique. Journal of Telecommunication, Electronic
and Computer Engineering (JTEC), 10(1-4), 23-27.
DATA 25
1 out of 25
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.