Data Mining and Business Intelligence Analysis
VerifiedAdded on 2020/03/23
|7
|1204
|86
AI Summary
This assignment delves into the application of data mining techniques for business analysis. Students will use Principal Component Analysis (PCA) to understand patterns in a dataset provided, focusing on customer credit card usage and online banking activity. They will then calculate various probabilities related to loan acceptance based on these factors using Naive Bayes. The aim is to analyze customer behavior and identify trends that can inform business strategies.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Data Mining and Visualization for Business Intelligence
Assignment
Student id/name
[Pick the date]
Assignment
Student id/name
[Pick the date]
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Question 1
a) The given data which captures features of 32 US based utilities has been analysed using PCA
and the results obtained are shared below.
The variance matrix is indicative of the fact that the four of the most significant principal
components account for 80% of the cumulative variance. Reducing the objective to explain only
this much variance, from the principal component matrix, the following conclusion can be drawn
about the identified principal components (Shumueli et. al., 2016).
The above has been derived by considering the top two features for each principal component
that tend to have the highest value with no reference to sign which essentially captures only
direction.
In light of the different scales used for measuring the given variables in the dataset, a critical
decision which needs to be made is in relation to the requirement for normalisation before PCA
analysis. For the given dataset, this is not a problem which is reflected both from the matrix
indicating the variance where the first principal component represents about 1/4th of the
1
a) The given data which captures features of 32 US based utilities has been analysed using PCA
and the results obtained are shared below.
The variance matrix is indicative of the fact that the four of the most significant principal
components account for 80% of the cumulative variance. Reducing the objective to explain only
this much variance, from the principal component matrix, the following conclusion can be drawn
about the identified principal components (Shumueli et. al., 2016).
The above has been derived by considering the top two features for each principal component
that tend to have the highest value with no reference to sign which essentially captures only
direction.
In light of the different scales used for measuring the given variables in the dataset, a critical
decision which needs to be made is in relation to the requirement for normalisation before PCA
analysis. For the given dataset, this is not a problem which is reflected both from the matrix
indicating the variance where the first principal component represents about 1/4th of the
1
cumulative variance. Further, this does not undergo any significant improvement even after
normalisation is done before running PCA. Hence, this is representative of the lack of need for
data normalisation here (Hofmann & Chisholm, 2016).
b) Advantages of using Principal Component Analysis (PCA) are shown below (Kudyba &
Hoptroff, 2012).
It provides easy to understand result which is especially in the form of covariance orthogonal
matrix.
This is most common and useful method to comment on the actual structure of the data.
This is a reduction process operation which minimizes the complex data into fewer/smaller
number data range.
The newly derived variable also known as principal components are not correlated with the
initial data variable and hence they exhibit zero correlation (Correlation coefficient = 0).
Disadvantages of using Principal Component Analysis (PCA) are shown below (Hofmann &
Chisholm, 2016).
The technique fails to examine the structure of variables which are not correlated with each
other by logical “linear relationship.”
The procedure is not applied on the set of variables which are of categorical type.
At times, the provided result from PCA comprises complex structures results from the dot
products of principal components.
Determination of magnitude of principal component is easy step however, the real direction
of principal component is hard work because each variable has own path which sometime
leads to complexity in interpretation.
Question 2
(a) Universal Bank
Total customer data = 5000
Partition:
Training = 60%
Validation =40%
2
normalisation is done before running PCA. Hence, this is representative of the lack of need for
data normalisation here (Hofmann & Chisholm, 2016).
b) Advantages of using Principal Component Analysis (PCA) are shown below (Kudyba &
Hoptroff, 2012).
It provides easy to understand result which is especially in the form of covariance orthogonal
matrix.
This is most common and useful method to comment on the actual structure of the data.
This is a reduction process operation which minimizes the complex data into fewer/smaller
number data range.
The newly derived variable also known as principal components are not correlated with the
initial data variable and hence they exhibit zero correlation (Correlation coefficient = 0).
Disadvantages of using Principal Component Analysis (PCA) are shown below (Hofmann &
Chisholm, 2016).
The technique fails to examine the structure of variables which are not correlated with each
other by logical “linear relationship.”
The procedure is not applied on the set of variables which are of categorical type.
At times, the provided result from PCA comprises complex structures results from the dot
products of principal components.
Determination of magnitude of principal component is easy step however, the real direction
of principal component is hard work because each variable has own path which sometime
leads to complexity in interpretation.
Question 2
(a) Universal Bank
Total customer data = 5000
Partition:
Training = 60%
Validation =40%
2
Training set would be derived by standard partition with the help of XLMiner tool (add in) in
excel.
This training set of 3000 customers would be further use to determine the various quantities and
also in the designing of pivot tables.
Predictors
Online: Customer is named under the active user of online service offered by universal bank.
Credit Card: Customer is classified as credit card holder of the universal bank.
Associated variable for interest
Personal Loan: Whether the customer is going to accept the offer of taking personal loan from
bank while taking the consideration of the two predictors.
Pivot table
The below highlighted table has incorporated online as a main column variable with respect to
both the predictors i.e. credit card as main row variable and personal loan as secondary row
variable.
The value 0 describes ‘NO’ and has the meaning that the customer is not having the respective
variable. On the other hand, 1 describes ‘YES’ and has the meaning that the customer is having
the respective variable.
Example:
CC =1 has the meaning that the customer is having credit card while CC = 0 has the meaning
that the customer is not having credit card.
Online = 1 has the meaning that the customer is a user of online service offered by bank while
online = 0 has the meaning that the customer is not a user of online service offered by bank.
Loan =1 has the meaning that the customer is taking loan from bank while loan = 0 has the
meaning that the customer is not taking loan from bank.
This understanding would be used to determine the probability and proportion in the below parts.
3
excel.
This training set of 3000 customers would be further use to determine the various quantities and
also in the designing of pivot tables.
Predictors
Online: Customer is named under the active user of online service offered by universal bank.
Credit Card: Customer is classified as credit card holder of the universal bank.
Associated variable for interest
Personal Loan: Whether the customer is going to accept the offer of taking personal loan from
bank while taking the consideration of the two predictors.
Pivot table
The below highlighted table has incorporated online as a main column variable with respect to
both the predictors i.e. credit card as main row variable and personal loan as secondary row
variable.
The value 0 describes ‘NO’ and has the meaning that the customer is not having the respective
variable. On the other hand, 1 describes ‘YES’ and has the meaning that the customer is having
the respective variable.
Example:
CC =1 has the meaning that the customer is having credit card while CC = 0 has the meaning
that the customer is not having credit card.
Online = 1 has the meaning that the customer is a user of online service offered by bank while
online = 0 has the meaning that the customer is not a user of online service offered by bank.
Loan =1 has the meaning that the customer is taking loan from bank while loan = 0 has the
meaning that the customer is not taking loan from bank.
This understanding would be used to determine the probability and proportion in the below parts.
3
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
(b) Probability (customer owns credit card | use online service, take loan) =?
Customers are having credit card and using online service = 522
Customers are having credit card and using online service and take loan =51
Probability = 51 / 522 = 0.0978
Hence, there is 9.78% probability that customers who have credit card and using online service
will take loan.
(c) Pivot tables
Description of table Pivot table
Column label: Online
Row label : Loan
Column label: Credit card
Row label : Loan
Probabilities
(i) Proportion that customer (will take the loan and also using the credit card)
P ( CC=1|Loan=1 ¿=( 93
304 )=0.305
(ii)
Probability that customer (will take loan and also using online service)
P ( Online=1|Loan=1 ¿= ( 183
304 )=0.601
4
Customers are having credit card and using online service = 522
Customers are having credit card and using online service and take loan =51
Probability = 51 / 522 = 0.0978
Hence, there is 9.78% probability that customers who have credit card and using online service
will take loan.
(c) Pivot tables
Description of table Pivot table
Column label: Online
Row label : Loan
Column label: Credit card
Row label : Loan
Probabilities
(i) Proportion that customer (will take the loan and also using the credit card)
P ( CC=1|Loan=1 ¿=( 93
304 )=0.305
(ii)
Probability that customer (will take loan and also using online service)
P ( Online=1|Loan=1 ¿= ( 183
304 )=0.601
4
(iii) Proportion of customers (will accept the loan)
P ( Loan=1 ) = ( 304
3000 )=0.101
(iv) Probability that customer (will not take loan and using the credit card)
P ( CC=1|Loan=0¿= ( 800
2696 )=0.296
(v) Proportion that customer (will not take loan and also using online service)
P ( Online=1|Loan=0 ¿=( 1586
2696 )=0.588
(vi) Probability that customers will not take loan
P ( Loan=0 ) =( 2696
3000 )=0.898
(d) Naive Bayes probability
Numerator = multiply all the probability having loan acceptance i.e.
¿
Denominator = multiply all the probability having loan acceptance + multiply all the probability
having loan rejection
= ¿
Naive Bayes probability = 0.0185 / 0.174 = 10.16%
(e) Key to identification of the best strategy would be to consider the difference in
probability of loan being awarded to customers with and without credit card. If is
apparent that customers having credit card tend to have higher incidence of loan being
extended. The same process is repeated in case of online services usage and again loan
extension is noticed higher for extensive users of online services. Thus, the best chance
from a customer point of view is to avail a credit card and use online bank services
frequently.
5
P ( Loan=1 ) = ( 304
3000 )=0.101
(iv) Probability that customer (will not take loan and using the credit card)
P ( CC=1|Loan=0¿= ( 800
2696 )=0.296
(v) Proportion that customer (will not take loan and also using online service)
P ( Online=1|Loan=0 ¿=( 1586
2696 )=0.588
(vi) Probability that customers will not take loan
P ( Loan=0 ) =( 2696
3000 )=0.898
(d) Naive Bayes probability
Numerator = multiply all the probability having loan acceptance i.e.
¿
Denominator = multiply all the probability having loan acceptance + multiply all the probability
having loan rejection
= ¿
Naive Bayes probability = 0.0185 / 0.174 = 10.16%
(e) Key to identification of the best strategy would be to consider the difference in
probability of loan being awarded to customers with and without credit card. If is
apparent that customers having credit card tend to have higher incidence of loan being
extended. The same process is repeated in case of online services usage and again loan
extension is noticed higher for extensive users of online services. Thus, the best chance
from a customer point of view is to avail a credit card and use online bank services
frequently.
5
Reference
Hofmann, M. & Chisholm, A. (2016) Text Mining and Visualization Case Studies Using Open-
Source Tools (4th ed.). London: CRC Press.
Shumueli, G., Bruce, C.P., Stephens, L.M. & Patel, R.N. (2016) Data Mining for Business
Analytics: Concepts, Techniques, and Applications with JMP Pro. Sydney: John Wiley &
Sons.
Kudyba, S. & Hoptroff, R. (2012) Data Mining and Business Intelligence: A Guide to
Productivity (3rd ed.). London: Idea Griou Inc.
6
Hofmann, M. & Chisholm, A. (2016) Text Mining and Visualization Case Studies Using Open-
Source Tools (4th ed.). London: CRC Press.
Shumueli, G., Bruce, C.P., Stephens, L.M. & Patel, R.N. (2016) Data Mining for Business
Analytics: Concepts, Techniques, and Applications with JMP Pro. Sydney: John Wiley &
Sons.
Kudyba, S. & Hoptroff, R. (2012) Data Mining and Business Intelligence: A Guide to
Productivity (3rd ed.). London: Idea Griou Inc.
6
1 out of 7
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.