Data Mining Assignment

Verified

Added on 2019/11/12

AI Summary

This data mining assignment focuses on two key areas: Principal Component Analysis (PCA) and the Naive Bayes Classifier. The PCA section involves evaluating and interpreting the results of a PCA analysis performed using XLMiner, identifying critical features, and discussing the advantages and disadvantages of the PCA method. The Naive Bayes section uses a dataset of Universal Bank customers to calculate probabilities related to loan acceptance based on factors like credit card ownership and online banking usage. The assignment requires creating pivot tables, calculating probabilities, and interpreting the results to determine the likelihood of loan acceptance under different conditions. The student is asked to analyze the data and draw conclusions about the factors influencing loan acceptance.

DATA MINING
ASSIGNMENT
STUDENT ID
[Pick the date]

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Data Mining
Question 1
Dimension Reduction
Part (A)
 Principal Component Analysis (PCA)
Number of variables = 8
XLMiner output is shown below:
 Evaluation and interpretation of the PCA
Based on the XLMiner output of variance analysis, it can be seen that 95% of variance is
described by the initial six principal components. Therefore, it is essential to reduce the principal
component matrix only for six components.
1

Data Mining
 The major six principal components and their respective significant features is finished
below:
In order to find the most critical features among the six components, these are defined based on
their contribution in variances. The critical features would be x2, x6 and x7.
When any particular variable is indicating the maximum or higher level of variances among the
other variables, then it is critical to normalize the data.
The level of variances would be determined based on the variance analysis output from
XLMiner.
2

Data Mining
From the above variances analysis, it can be seen that the maximum variance is 27.16 shown by
first principal component. However, it can also be seen this component is showing only 27.16%
of the total variance. Therefore, it would be fair to conclude that the total variance is resulted
from the other variables also and thus, the data normalization is not needed in this case.
Part (B)
Advantages of Principal Component Analysis
 It minimizes the compound set of data with several dimensions into simplified data set with
specified dimensions.
 It also minimize the risk factor that is involved in over-fitting of the data set
 It reduces the complexity of analysis even if the number of variables is high
 PCA technique is used orthogonal form of variables and hence, it becomes easy to interpret
Disadvantages of Principal Component Analysis
 It can be used only for variables which have linear associations and orthogonal projections
 This method is used only when the data is following Gaussian Distribution
3

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Data Mining
 Selection of variable with maximum variation would be the issue in some of the cases. This
is more frequent when the data is adopted from blind sources.
Question 2
Naïve Bayes Classifier
Number of customers of Universal Bank = 5000
Only 9.6% of customers would accept the loan = 480
Partition of the data would be done based on the percentage as given below:
Training = 60%
Validation = 40%
Part (A)
Pivot table for the “training data” is furnished in excel spreadsheet after using XLMiner and is
shown below:
Column variable - Online
Primary row variable – Credit card
Secondary row variable – Personal loan
4

Data Mining
Part (B)
Probability that the customer who owns bank credit card and actively using online banking
service would accept the loan from Universal Bank =?
Total count of customers who owns credit card and uses online banking service = 538
Total count of customers who owns credit card and uses online banking service and would ready
to accept the loan = 53
The requisite probability would be determined as given below:
P= ( 53 )
( 538) =0.0985
There is only 9.85% probability that the customer who owns credit card and usage online
banking service would ready to take loan.
Part (C)
Pivot table for the “training data” is shown below:
 Column variable - Online and Row variable – Personal loan
 Column variable – Credit card and Row variable – Personal loan
5

Data Mining
 Computation of the quantities P(A|B) is shown below:
(i) P ( CC=1|Loan=1 ¿
Favorable case = 89
Total count = 296
Proportion of the credit card holders among the loan acceptors = 89 / 296 = 0.3006
(ii) P ( Online=1|Loan=1 ¿
Favorable case = 170
Total count = 296
Probability = 170 / 296 = 0.574
(iii) P ( Loan=1 )
Favorable case = 296
Total count = 3000
Proportion of loan acceptors among total customers = 296 / 3000 = 0.098
(iv) P ( CC=1|Loan=0¿
Favorable case = 798
Total count = 2704
Probability = 798 /2704 = 0.295
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Data Mining
(v) P ( Online=1|Loan=0 ¿
Favorable case = 1644
Total count = 2704
Probability = 1644 / 2704 = 0.607
(vi) P ( Loan=0 )
Favorable case = 2704
Total count = 3000
Probability = 2704 / 3000 = 0.901
Part (D)
Naïve Bayes Probability
P ( Loan=1|CC=1 , Online=1 ¿
Favorable case = (0.3006*0.574* 0.098) = 0.0169
Total count = (0.3006*0.574* 0.098) + (0.295*0.607*0.901) = 0.1782
Naïve Bayes Probability = 0.0169/0.1782 = 0.0948
Part (E)
7