Data Mining and Visualization for Business Intelligence Project

Verified

Added on 2019/11/26

AI Summary

This data mining project analyzes a dataset using Principal Component Analysis (PCA) for dimension reduction and a Naive Bayes Classifier. The PCA section utilizes Excel and XLMiner to identify key variables, determine the need for data normalization, and discusses the advantages and disadvantages of the method. The Naive Bayes Classifier section focuses on a customer dataset, partitioning it for training and validation, and calculating probabilities using pivot tables to determine the best strategy for a bank customer to obtain a personal loan based on credit card usage and online banking services. The project includes references to relevant literature on business intelligence and data mining techniques.

Data Mining and Visualization for Business Intelligence
Assignment
Student Id
[Pick the date]

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 1
Dimension Reduction
Given data
(a) Principal Component Analysis (PCA) for the given data set has been done in excel by
using XLMiner and the final output is shown below:
1

Based on the variance analysis, it can be said nearly 95% of the variances are captured by the top
six principal components out of eight components which are highlighted.
Therefore, the above component matrix would be reduced and would contain only first top 6
principal components. The reduced matrix is shown below:
2

The below highlighted table shows the principal components and their statistical significant
feature derived on the basis of the table highlighted above and the respective largest values in
each of the columns.
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Further, after considering the variance table, it can be said the most pivotal features are of X2,
X6 and X7.
When significantly high variation is present in one of the variable with respect to the other
variables, then data normalization is needed.
It is apparent from the above table that in variance percentage the maximum variation is about to
27.16% for variable 1. This first principal component percentage variance is not having
significantly high value as compared with other variables and therefore, it would be fair to
conclude that total variance is resulting not only from one single variable but also from other
components. Hence, data normalization is not needed in this case (Grossmann & MA, 2015).
(b) The main advantages and disadvantages of using principal component analysis are
furnished below:
Advantages of PCA
 This method is used to simplify the complex data set.
4

 In PCA the usage of most variance method would provide the levy to select the pivotal
dimensions out of multiple dimensions of the variables for the ease of analysis.
 In PCA, the representation of the variable components is in the “Orthogonal Form” which is
easy to simplify and to interpret the result (Hofmann & Andrew, 2016).
Disadvantage of PCA
 PCA method would only be considered when the variables are having linear relationship and
orthogonal projections. Further, if the variables are showing non-linear relationship then this
method would not be used and hence, other methods would be taken into account.
 Identification of the variable with highest variance among the other variables would be the
problem, especially when blind source separation present.
 PCA method would only be used for “Gaussian distribution” where mean and variance
would be used for distribution description and hence, it cannot be used for other statistical
distribution where mean and variance are not critical (Shmueli, et. al., 2016).
Question 2
Naïve Bayes Classifier
Total number of customers (data given) = 5000 customers
Partition of the data - Training: Validation = 60%: 40%
5

Data is attached in the excel spreadsheet (Hofmann & Andrew, 2016).
(a) Pivot table
Column variable - Online
Row variable - CC (Credit card)
Secondary row variable - Loan (Personal loan)
(b) “Probability that a randomly selected customer who has both bank’s credit card and
online banking service would ready to accept the personal loan”
Number of customers with CC and online = 542
Number of customers with CC, loan and online = 56
Probability ¿ Number of customers withCC , loan∧online
Number of customers withCC∧online
¿ 56
542=0.103
(c) Pivot tables and probability
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

First pivot table:
Column variable - Online
Row variable – Loan (Personal loan)
Second pivot table:
Column variable – CC (Credit card)
Row variable – Loan (Personal loan)
 Proportion or probability is computed below based on the above highlighted pivot tables.
S. No. Proportion or probability
(i) P ( CC=1| Loan=1 ¿ ¿ 97
304 =0.319
7

(ii) P ( Online=1|Loan=1 ¿ ¿ 179
304 =0.588
(iii) P ( Loan=1 ) ¿ 304
3000 =0.101
(iv) P ( CC=1|Loan=0¿ ¿ 814
2696 =0.301
(v) P ( Online=1|Loan=0 ¿ ¿ 1597
2696 =0.592
(vi) P ( Loan=0 ) ¿ 2696
3000 =0.898
(d) Based on the quantities determined in part (c), the value of Naïve Bayes Probability is
given below:
Naïve Bayes Probability
(d)
P ( Loan=1|CC=1 , Online=1 ¿
¿ 0.319∗0.588∗0.101
0.319∗0.588∗0.101+ ( 0.301∗0.592∗0.898 )
¿ 0.01894
0.01894+0.1600
¿ 0.0957
8

(e) The best possible strategy for the bank customer to get the personal loan is not to have
credit card and not to use the online services extended by the bank because this is the
combination that tends to maximize the probability of obtaining a loan.
Reference
Grossmann, W., & MA, R.S. (2015) Fundamentals of Business Intelligence (5th ed.). New York:
Springer.
Hofmann, M. & Andrew Chisholm (2016) Text Mining and Visualization: Case Studies Using
Open-Source Tools (3rd ed.). Florida: CRC Press.
Shmueli, G., Bruce, C.P., Stephens, L.M., & Patel, R. N. (2016) Data Mining for Business
Analytics: Concepts, Techniques, and Applications with JMP Pro (2nd ed.). Sydney: John
Wiley & Sons.
9