Data Mining and Visualization for Business Intelligence Analysis

Verified

Added on 2020/04/01

AI Summary

This assignment delves into data mining and visualization techniques for business intelligence, analyzing a business case involving Universal Bank. It begins with a PCA (Principal Component Analysis) output, outlining steps to determine variance, identify key features, and address normalization. The advantages and disadvantages of PCA are discussed. The second part focuses on customer data analysis using an Excel spreadsheet. The data is partitioned using XLMiner, and pivot tables are designed to determine the probability of customers accepting a loan based on their credit card ownership and online banking usage. Conditional probabilities and Naive Bayes probability are calculated to assess loan acceptance rates. The analysis highlights that customers with credit cards and active online banking usage have a higher probability of accepting loans. The document provides a comprehensive overview of data analysis methods and practical application in a business context.

DATA MINING AND VISUALISATION FOR BUSINESS INTELLIGENCE
BUSINESS CASE ANALYSIS – I
STUDENT ID
[Pick the date]

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 1
(a)PCA Output without normalisation
The PCA analysis is essentially based on the above output. The following steps need to be
performed in this regards.
Step 1: Determine the extent of variance that needs to be accounted for. This information is
needed so as to identify the principal components which need to be analysed further. For
instance, a limit of 97% variance would involve the first seven principal components while a
lower limit of 80% would restrict it to four seven principal components.
Step 2: Based on the principal components identified above, the next step is to carry out
identification of features concerning the utilities which are essential for each of these. In this
endeavor, only the magnitude should be considered and signs ignored. For instance , in case of
1

principal component 1, the highest coefficient magnitudes are for two features namely x1 and x2.
This implies that these features are significant in relation to accounting for first principal
component. Similarly, this process needs to be extends for the other principal components as
well.
Step 3: Based on the key features identified, then further summary of critical features can be
made based on their relative importance indicated in the Principal Component matrix. X1 and X3
emerge to be the most significant features based on which comparison between utility firms may
be carried ahead
Normalisation
Sometimes before conducting PCA, data normalisation is done so as to eliminate the impact of
the different scale used in variables. If not rectified, the total variance contribution of a variable
having high scale can be significantly overrepresented thus reflecting the higher importance.
However, for the given case, PCA with data normalisation was carried and no significant
difference could be notices and thereby it would be opportune for the given case to be conducted
without normalisation.
(b)
The advantages of PCA method in data mining and visualization are as illustrated below:
1. The result is generated in the form of orthogonal matrix and therefore, the comparison and
analysis is convenient.
2. PCA is considered a “Max- Variance” technique and hence, it eliminates the variables which
are noise variable (low variance) for the analysis.
2

3. It also decreases the risk that may be generated in large set data in the analysis of over-fitting
of data.
4. The visualization of the result of PCA is quite easy because each respective principal
component had its unique axis which is exactly at the right angle of another principal
component’s axis.
5. In dimensional reduction process, there is a possibility that the data may get over-fitted. This
risk of data over-fitting can be reduced by a good number by applying the PCA method.
6. This method is best suitable to examine and visualize the underlying structure.
7. Visualization is done either in m- dimensional space or in p –dimensional space which can be
easily be visualized in the cloud form.
8. Variables with linear and logical relations can easily be analyzed by deployment of PCA
method.
The disadvantages of PCA method in data mining and visualization are as illustrated
below:
1. For dataset which has resulted from blind resource with undefined mean and variance, then
PCA cannot be employed for data reduction.
2. When the dataset has any variable of categorical type, then PCA method cannot be applied.
3. Computation of max variance component in the long complex orthogonal matrix is difficult
task and time taking.
4. Further, the covariance matrix can be complex when the key significant patterns are high and
variance difference is very less.
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

5. Scaling can be a problem in some of the cases and hence, data normalization is required
before running PCA on data.
6. This method fails to analyze the data variables when they exhibit non- linear correlation.
7. Each component is distributed at its own axis in the space and this axis is exactly at the 90
degree of another component’s axis. Hence, the determination of actual direction of principal
component is difficult task.
8. When the size of matrix is significantly high and hence, the evaluation of max variance
component is a critical task.
Question 2
The details of customers (5000 customers) of Universal Bank is highlighted in excel spreadsheet.
The analysis is focused on the training data and therefore, the first step is to make partition of the
given data.
4

For partition of the given data, XLMiner Analytical Tool (Add in - in excel) is used. In standard
partition in XLMiner Analytical Tool, the percentage between the training set and validation set
are 60% and 40% respectively. This is the same partition which is given in the question.
Hence, there is no need to change the partition percentage. The training set contains 3000 data
and validation set contains 2000 data.
Output generated through XLMiner
Training data has been separated by clicking the “output Navigator” training data.
(a) Design of pivot table
5

It has been made by using “Pivot table” inbuilt function of excel. The column label has been
taken as “Online” and the first row label has been taken as “Credit Card (CC)” and the second
row label has been taken as “Personal Loan (Loan).”
Pivot table
Notations
Label = 0 (Represents NO)
Label = 1 (Represents YES)
Such as, CC = 1 (This provide a representation that YES a customer possess credit card of
universal bank)
Online = 0 (This provide a representation that No a customer is not a user of online bank service
of universal bank)
Loan = 1 (This provide a representation that YES a customer will accept the loan from universal
bank).
(b) Aim
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

To determine the probability that customer who already possess credit card of universal bank and
also a user of online bank service of universal bank will accept the loan from universal bank.
 Probability would be determined by dividing count of favorable outcomes by total count of
expected outcomes.
 Count of favorable outcomes would be Credit card (YES), online (Yes) and Loan (Yes) i.e.
( CC =1|Loan=1 ,Online=1 ¿=51
 Total count of expected outcomes Credit card (YES), online (Yes) i.e.
( CC=1|Online=1 ) =522
Conditional probability = Count of favorable outcomes / Total count of expected outcomes
= 51/522 = 0.0978
Hence, conclusion can be made that 9.78% conditional probability that customer who already
possess credit card of universal bank and also a user of online bank service of universal bank will
accept the loan from universal bank.
(c) Designing of two pivot tables
It has been made by using “Pivot table” inbuilt function of excel.
1. The column label has been taken as “Online” and the row label has been taken as “Personal
Loan (Loan).”
7

2. The column label has been taken as “Credit Card (CC)” and the row label has been taken as
“Personal Loan (Loan).”
Computation of P (A|B) for various conditions
(i) Proportion of customers who possess credit card and
will accept loan offer i.e. CC YES and Loan YES
P ( CC=1|Loan=1 ¿
Positive outcome = 93
Total possible outcomes = 304
= 93/304 = 0.305
(ii) Probability that customers who are active user of
online service and will accept loan offer i.e. Online
YES and Loan YES
P ( Online=1|Loan=1 ¿
Positive outcome = 183
Total possible outcomes = 304
=183/304=0.601
(iii) Proportion of customers who will accept offer of loan
i.e. Loan YES
P ( Loan=1 )
Positive outcome = 304
Total possible outcomes =
3000
8

=304/3000=0.101
(iv) Probability that customers will not accept offer of
loan even if they possess credit card i.e. CC YES |
Loan NO
Probability P ( CC=1|Loan=0¿
Positive outcome = 800
Total possible outcomes =
2696
=800/2696=0.296
(v) Probability that customers will not accept offer of
loan even if they are active user of online service i.e.
Online YES, Loan NO
P ( Online=1|Loan=0 ¿
Positive outcome = 1586
Total possible outcomes =
2696
=1586/2696=0.588
(vi) Probability that customers will not accept offer of
loan i.e. Loan NO
P ( Loan=0 )
Positive outcome = 2696
Total possible outcomes =
3000
= 2696/3000=0.898
(d) Naïve Bayes Probability is also a conditional probability which comprises the following
two factors.
Positive outcome = (Multiplication of all the probability on which the customer will accept offer
of loan) i.e. = ( 0.305∗0.601∗0.101 )
9

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Total possible outcomes = Positive outcome + conditional factor
Conditional factor = Multiplication of all the probability on which the customer will not accept
offer of loan =( 0.296∗0.588∗0.898 )
Naïve Bayes Probability
¿ 10.61%
Hence, Naïve Bayes Probability by taking the quantities of part © comes out to be 10.61%.
(e) The respective probabilities for loan being offered needs to be seen in the light of mainly two
events i.e. issuance of credit card and also active usage of online services. It has been noticed
that customers who are active users of online bank services tend to have higher probability of
loan offering. Also, similar observation can be made for those customers who have credit
cards. Hence, in view of this, it may be wise to opine that a given customer can maximize the
chances of taking loan by possessing a credit card and using online bank services on an
active basis.
10