DATA 25: Comparative Analysis of Classification Models in Data Mining

Verified

Added on 2022/09/28

AI Summary

This report, created for DATA 25, provides a comparative analysis of five classification algorithms: Multilayer Perceptron, Naive Bayes, J48, Random Forest, and REPtree. The algorithms were tested on three datasets: Iris, Breast Cancer, and Diabetes, using the Weka data mining tool. The report includes descriptive statistics, graphical representations, and performance evaluations based on 10-fold cross-validation, focusing on classification accuracy and confusion matrices. Task 2 explores data mining applications, discussing its role in business decision-making, customer relationships, and product development, including a case study on the development of electric cars. The report also outlines the stages of data mining, from data source to deployment, and highlights key parameters like classification, sequence analysis, clustering, and forecasting. References to relevant literature are also included.

Running Head: DATA 1
Data Mining
Name of Student
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA 2
Classification Performance Evaluation
In this part, there will be a comparison of the performance of different classification models on
three datasets. What this means is that all the five models are run on one dataset and then we
check on the actual performance all the models on every dataset.
A. Iris Dataset
We start to think of the statistical alignment of the dataset and the graphs and the results of the
descriptive statistics per variable.
figure 1
From figure 1 it is very clear to see the actual instances under consideration are 150 and a total of
5 variables.
figure 2
From figure 2 above it is clear to see that the name of the variable as sepal length and the number
of distinct values sum up to 35 in total. The variable type is numeric and no cell has a missing
entry. The constants that are recorded are the mean, maximum, minimum, standard deviation

DATA 3
values (Eldén, 2019). The proceeding figures will give all the descriptive statistics for the
proceeding variable case.
figure 3
figure 4
figure 5
figure 6

DATA 4
figure 7
Figure 7 shows all the graphical alignment of each variable both numeric and nominal.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA 5
figure 8
Figure 9 gives the scatter plot for variables and there are a total of three variables and the clusters
that they develop are in terms of the variables.

DATA 6
a. Algorithms Testing
i. Multilayer Perceptron
ii. Naive Bayes

DATA 7
iii. J48
iv. Random Forest
v. RERTree

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA 8
vi. Models comparison
From all the screenshots that have been provided, there are the summary statistics that give
correctly classified instances. The more the number of correctly classified instances, the higher
the percentage and the higher the percentage the better the model of classification. Therefore, on
Iris dataset, starting from the first classification models we have; 97%, 96%, 96%, 95% and 94%.
Therefore, the best model as per this will be the Multilayer Perceptron and REP tree is a poorer
performer in all the models. This can be seen as well on the confusion matrix as there are fewer
instances that are incorrectly classified (Jabez, Gowri, Vigneshwari, Mayan & Srinivasulu,
2019).

DATA 9
B. Breast Cancer
In this case, there will be a look into the descriptive statistics in a similar way as it was done in
the Iris dataset above. Each figure gives the statistics as per variable under the variable name.
There is a total of 10 variables
figure 1
figure 2
figure 3