Data Mining Assignment 1 Report: Classification, Clustering, and More
VerifiedAdded on 2022/10/12
|30
|3471
|478
Report
AI Summary
This report presents the analysis of a data mining assignment, encompassing classification, numeric prediction, clustering, and association finding techniques. The report explores the performance of various classifiers such as Zero R, One R, J48, and IBK, evaluating their accuracy and predictive capabilities. It investigates the impact of parameter settings on classifier performance, including the C and M values for J48 and the K value for IBK. Furthermore, the report delves into numeric prediction using classifiers like Zero R, MP5, and IBK, comparing their correlation coefficients and mean absolute errors. Clustering is explored using K-Means and EM algorithms, examining the effects of different K values and seeds. Finally, the report examines association finding using the Apriori algorithm on different datasets. The report also includes an analysis of attribute selection algorithms and identifies the best performing algorithms for each task.

University
Semester
Data Mining
Student ID
Student Name
Submission Date
1
Semester
Data Mining
Student ID
Student Name
Submission Date
1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Part 1 – Classification 4
1. Task 1 – Classifier 4
2. Task 2 - J48 Classifier 5
3. Task 3 - Reset J48 parameters 6
4. Task 4 - IBK Classifier 7
5. Task 5 - Predictive Accuracy 7
6. Task 6 – Accuracy 8
7. Task 7 - Golden Nuggets 9
8. Task 8 – Attribute Selection Algorithm 9
Part 2 - Numeric Prediction 10
1. Task 1 – Classifiers 10
2. Task 2 - Explore Different Parameters Settings 11
3. Task 3 - Investigation 12
4. Golden Nuggets 12
Part 3 – Clustering 13
1. Task 1 - K Means Clustering 13
2. Task 2 - Effects of Seeds 18
3. Task 3 - EM Algorithm 19
4. Task 4 - Normalize Filter 20
5. Task 5 - Values Changes 22
6. Task 6 – Clusters 22
7. Task 7 - Compare K Means and EM clustering 22
8. Task 8 - Golden Nuggets 22
Part 4 - Association Finding 22
1. Task 1 – Representation 22
2. Task 2 - Apriori Algorithm - Groceries Data set 1 23
3. Task 3 - Explore Different Possibilities 24
4. Task 4 - Apriori Algorithm - Groceries Data set 2 25
5. Task 5 - Explore Different Possibilities 25
6. Task 6 - Other Associators 26
7. Task 7 - Golden Nuggets 26
2
Part 1 – Classification 4
1. Task 1 – Classifier 4
2. Task 2 - J48 Classifier 5
3. Task 3 - Reset J48 parameters 6
4. Task 4 - IBK Classifier 7
5. Task 5 - Predictive Accuracy 7
6. Task 6 – Accuracy 8
7. Task 7 - Golden Nuggets 9
8. Task 8 – Attribute Selection Algorithm 9
Part 2 - Numeric Prediction 10
1. Task 1 – Classifiers 10
2. Task 2 - Explore Different Parameters Settings 11
3. Task 3 - Investigation 12
4. Golden Nuggets 12
Part 3 – Clustering 13
1. Task 1 - K Means Clustering 13
2. Task 2 - Effects of Seeds 18
3. Task 3 - EM Algorithm 19
4. Task 4 - Normalize Filter 20
5. Task 5 - Values Changes 22
6. Task 6 – Clusters 22
7. Task 7 - Compare K Means and EM clustering 22
8. Task 8 - Golden Nuggets 22
Part 4 - Association Finding 22
1. Task 1 – Representation 22
2. Task 2 - Apriori Algorithm - Groceries Data set 1 23
3. Task 3 - Explore Different Possibilities 24
4. Task 4 - Apriori Algorithm - Groceries Data set 2 25
5. Task 5 - Explore Different Possibilities 25
6. Task 6 - Other Associators 26
7. Task 7 - Golden Nuggets 26
2

References 27
3
3
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Part 1 – Classification
1. Task 1 – Classifier
The Training and Cross-Validation errors table for Zero R, One R and J48 and IBK
classifiers are illustrated as below.
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Kappa statistic 0
Mean absolute error 0.4202
Root mean squared error 0.4583
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 1000
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Kappa statistic 0.3009
Mean absolute error 0.257
Root mean squared error 0.507
Relative absolute error 61.1672 %
Root relative squared error 110.6259 %
Total Number of Instances 1000
J48
4
1. Task 1 – Classifier
The Training and Cross-Validation errors table for Zero R, One R and J48 and IBK
classifiers are illustrated as below.
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Kappa statistic 0
Mean absolute error 0.4202
Root mean squared error 0.4583
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 1000
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Kappa statistic 0.3009
Mean absolute error 0.257
Root mean squared error 0.507
Relative absolute error 61.1672 %
Root relative squared error 110.6259 %
Total Number of Instances 1000
J48
4
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.001
Root mean squared error 0.001
Relative absolute error 0.2375 %
Root relative squared error 0.2178 %
Total Number of Instances 1000
Based on training and cross-validation errors for Xero, One R, J48 and IBK
classifiers, the IBK classifiers has low training and cross-validation errors compared to other
classifiers and it provides the 100% of correctly classified instances. So, it is the best
classifiers compared to other classifiers (Azzalini and Scarpa, 2012).
2. Task 2 - J48 Classifier
They determined the C and M values are presented below.
5
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.001
Root mean squared error 0.001
Relative absolute error 0.2375 %
Root relative squared error 0.2178 %
Total Number of Instances 1000
Based on training and cross-validation errors for Xero, One R, J48 and IBK
classifiers, the IBK classifiers has low training and cross-validation errors compared to other
classifiers and it provides the 100% of correctly classified instances. So, it is the best
classifiers compared to other classifiers (Azzalini and Scarpa, 2012).
2. Task 2 - J48 Classifier
They determined the C and M values are presented below.
5

● C Values - 0.25
● M Values – 2
These values minimize the amount of overfitting. Results are presented below.
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.956 0.380 0.854 0.956 0.902 0.640 0.857 0.905 good
0.620 0.044 0.857 0.620 0.720 0.640 0.857 0.783 bad
Weighted Avg. 0.855 0.279 0.855 0.855 0.847 0.640 0.857 0.869
=== Confusion Matrix ===
a b <-- classified as
669 31 | a = good
114 186 | b = bad
6
● M Values – 2
These values minimize the amount of overfitting. Results are presented below.
=== Summary ===
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.956 0.380 0.854 0.956 0.902 0.640 0.857 0.905 good
0.620 0.044 0.857 0.620 0.720 0.640 0.857 0.783 bad
Weighted Avg. 0.855 0.279 0.855 0.855 0.847 0.640 0.857 0.869
=== Confusion Matrix ===
a b <-- classified as
669 31 | a = good
114 186 | b = bad
6
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

3. Task 3 - Reset J48 parameters
Here, we will reset the parameters as,
● C Values - 0.25
● M Values – 10
These values minimize the amount of overfitting. The Results is presented as below
(Chattamvelli, 2016).
=== Summary ===
Correctly Classified Instances 805 80.5 %
Incorrectly Classified Instances 195 19.5 %
Kappa statistic 0.5091
Mean absolute error 0.2919
Root mean squared error 0.382
Relative absolute error 69.4623 %
Root relative squared error 83.3599 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.901 0.420 0.834 0.901 0.866 0.514 0.804 0.871 good
0.580 0.099 0.716 0.580 0.641 0.514 0.804 0.676 bad
Weighted Avg. 0.805 0.324 0.798 0.805 0.799 0.514 0.804 0.812
=== Confusion Matrix ===
a b <-- classified as
7
Here, we will reset the parameters as,
● C Values - 0.25
● M Values – 10
These values minimize the amount of overfitting. The Results is presented as below
(Chattamvelli, 2016).
=== Summary ===
Correctly Classified Instances 805 80.5 %
Incorrectly Classified Instances 195 19.5 %
Kappa statistic 0.5091
Mean absolute error 0.2919
Root mean squared error 0.382
Relative absolute error 69.4623 %
Root relative squared error 83.3599 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.901 0.420 0.834 0.901 0.866 0.514 0.804 0.871 good
0.580 0.099 0.716 0.580 0.641 0.514 0.804 0.676 bad
Weighted Avg. 0.805 0.324 0.798 0.805 0.799 0.514 0.804 0.812
=== Confusion Matrix ===
a b <-- classified as
7
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

631 69 | a = good
126 174 | b = bad
4. Task 4 - IBK Classifier
The determined k value is 1 which is used to minimize the amount of overfitting. The
Results table is illustrated below.
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.19 seconds
=== Summary ===
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.001
Root mean squared error 0.001
Relative absolute error 0.2375 %
Root relative squared error 0.2178 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
8
126 174 | b = bad
4. Task 4 - IBK Classifier
The determined k value is 1 which is used to minimize the amount of overfitting. The
Results table is illustrated below.
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.19 seconds
=== Summary ===
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.001
Root mean squared error 0.001
Relative absolute error 0.2375 %
Root relative squared error 0.2178 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
8

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 good
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 bad
Weighted Avg. 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
=== Confusion Matrix ===
a b <-- classified as
700 0 | a = good
0 300 | b = bad
5. Task 5 - Predictive Accuracy
The Predictive Accuracy for Classifier is illustrated below (Michalski, 2014).
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Total Number of Instances 1000
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Total Number of Instances 1000
J48
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
9
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 good
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 bad
Weighted Avg. 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
=== Confusion Matrix ===
a b <-- classified as
700 0 | a = good
0 300 | b = bad
5. Task 5 - Predictive Accuracy
The Predictive Accuracy for Classifier is illustrated below (Michalski, 2014).
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Total Number of Instances 1000
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Total Number of Instances 1000
J48
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
9
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Total Number of Instances 1000
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Total Number of Instances 1000
The Best Predictive Accuracy is 100% which is IBK classifier and the worst
predictive Accuracy is 70% which is Zero R.
6. Task 6 – Accuracy
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Accuracy is 70%
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Accuracy is 74.3%
J48
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Accuracy is 85.5%
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Accuracy is 100%
10
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Total Number of Instances 1000
The Best Predictive Accuracy is 100% which is IBK classifier and the worst
predictive Accuracy is 70% which is Zero R.
6. Task 6 – Accuracy
Xero
Correctly Classified Instances 700 70 %
Incorrectly Classified Instances 300 30 %
Accuracy is 70%
One R
Correctly Classified Instances 743 74.3 %
Incorrectly Classified Instances 257 25.7 %
Accuracy is 74.3%
J48
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Accuracy is 85.5%
IBK
Correctly Classified Instances 1000 100 %
Incorrectly Classified Instances 0 0 %
Accuracy is 100%
10
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The Best Accuracy is 100% which is IBK classifier and the worst Accuracy is 70%
which is Zero R.
7. Task 7 - Golden Nuggets
Basically, data mining is considered the task of finding useful patterns in data which
is based on data and the technique used in classification. In some cases, some patterns are so
unexpected and a valuation that they can be known as golden nuggets. But, there is no golden
nugget on the provided data (Mitchell, 2017).
8. Task 8 – Attribute Selection Algorithm
Selected Attributes are shown below.
=== Attribute Selection on all input data ===
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 134
The merit of the best subset found: 0.076
Attribute Subset Evaluator (supervised, Class (nominal): 21 class):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 1,2,3 : 3
checking_status
duration
credit_history
After, we will reduce the attributes; the provided data set has the following accuracy
by using the IBK classifier which results is illustrated as below.
Correctly Classified Instances 574 57.4 %
Incorrectly Classified Instances 426 42.6 %
Kappa statistic 0.1515
Mean absolute error 0.2179
Root mean squared error 0.33
11
which is Zero R.
7. Task 7 - Golden Nuggets
Basically, data mining is considered the task of finding useful patterns in data which
is based on data and the technique used in classification. In some cases, some patterns are so
unexpected and a valuation that they can be known as golden nuggets. But, there is no golden
nugget on the provided data (Mitchell, 2017).
8. Task 8 – Attribute Selection Algorithm
Selected Attributes are shown below.
=== Attribute Selection on all input data ===
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 134
The merit of the best subset found: 0.076
Attribute Subset Evaluator (supervised, Class (nominal): 21 class):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 1,2,3 : 3
checking_status
duration
credit_history
After, we will reduce the attributes; the provided data set has the following accuracy
by using the IBK classifier which results is illustrated as below.
Correctly Classified Instances 574 57.4 %
Incorrectly Classified Instances 426 42.6 %
Kappa statistic 0.1515
Mean absolute error 0.2179
Root mean squared error 0.33
11

Relative absolute error 87.5119 %
Root relative squared error 93.5983 %
Total Number of Instances 1000
The accuracy is 57.4% which is low accuracy compared to full input data.
Part 2 - Numeric Prediction
1. Task 1 – Classifiers
The Training and Cross-Validation errors table for Zero R, MP5 and IBK classifiers
are illustrated as below.
Zero R
Correlation coefficient 0
Mean absolute error 39.3139
Root mean squared error 51.6914
The relative absolute error of 100 %
Root relative squared error 100 %
Total Number of Instances 303
MP5
Correlation coefficient 0.4033
Mean absolute error 36.7682
Root mean squared error 47.3153
Relative absolute error 93.5248 %
Root relative squared error 91.5342 %
Total Number of Instances 303
12
Root relative squared error 93.5983 %
Total Number of Instances 1000
The accuracy is 57.4% which is low accuracy compared to full input data.
Part 2 - Numeric Prediction
1. Task 1 – Classifiers
The Training and Cross-Validation errors table for Zero R, MP5 and IBK classifiers
are illustrated as below.
Zero R
Correlation coefficient 0
Mean absolute error 39.3139
Root mean squared error 51.6914
The relative absolute error of 100 %
Root relative squared error 100 %
Total Number of Instances 303
MP5
Correlation coefficient 0.4033
Mean absolute error 36.7682
Root mean squared error 47.3153
Relative absolute error 93.5248 %
Root relative squared error 91.5342 %
Total Number of Instances 303
12
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 30

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2025 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.