Data Analytics: Bank Marketing Case Study Report
VerifiedAdded on 2022/11/23
|6
|1146
|296
AI Summary
This case study report discusses the Portuguese Bank direct marketing campaign to sell their term deposits. It includes data analysis, problem statement, methods used, algorithms/models, feature analysis & suggestions, and conclusion.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
CASE STUDY REPORT on
DATA ANALYTICS:
BANK MARKETING
DATA ANALYTICS:
BANK MARKETING
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Contents:
1. INTRODUCTION
2. DATA
3. PROBLEM STATEMENT / OBJECTIVE
4. METHODS USED
5. ALGORITHM / MODELS
6. FEATURE ANALYSIS & SUGGESTIONS
7. CONCLUSION
1. INTRODUCTION
2. DATA
3. PROBLEM STATEMENT / OBJECTIVE
4. METHODS USED
5. ALGORITHM / MODELS
6. FEATURE ANALYSIS & SUGGESTIONS
7. CONCLUSION
1. Introduction
The Banks are available to provide the services and to generate revenue from them. With this
mindset, banks run different campaigns to sell or provide their services on lease. In this case study, we
are introducing the Portuguese Bank direct marketing campaign to sell their term deposits.
2. Data
The Data Set is the Portuguese Bank Marketing Data Set in the University of California, Irvine (UCI)
Machine Learning Repository at the following https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
The objective of this campaign is to pursue customers to subscribe to the term deposit. Also, the
marketing campaigns were based on phone calls.
This is an open source dataset so that the insights from this dataset could
improve future marketing campaigns for a Portuguese bank. We will discuss further the different attributes
from the dataset as follows:
Table 1: Features description of the dataset
Features Feature Description
Age Age of the customer
Job Type of job
Marital Marital status
Education Education Level
Default Has credit in default
Balance The balance of the customer
Housing Has housing loan?
Loan Has Personal Loan?
Contact Contact communication type
Day last contact day of the week
Month last contact month of the year
Duration last contact duration, in seconds
Campaign number of contacts performed
Pdays number of days that passed by after a previous campaign
Previous number of contacts performed before this campaign
The Banks are available to provide the services and to generate revenue from them. With this
mindset, banks run different campaigns to sell or provide their services on lease. In this case study, we
are introducing the Portuguese Bank direct marketing campaign to sell their term deposits.
2. Data
The Data Set is the Portuguese Bank Marketing Data Set in the University of California, Irvine (UCI)
Machine Learning Repository at the following https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
The objective of this campaign is to pursue customers to subscribe to the term deposit. Also, the
marketing campaigns were based on phone calls.
This is an open source dataset so that the insights from this dataset could
improve future marketing campaigns for a Portuguese bank. We will discuss further the different attributes
from the dataset as follows:
Table 1: Features description of the dataset
Features Feature Description
Age Age of the customer
Job Type of job
Marital Marital status
Education Education Level
Default Has credit in default
Balance The balance of the customer
Housing Has housing loan?
Loan Has Personal Loan?
Contact Contact communication type
Day last contact day of the week
Month last contact month of the year
Duration last contact duration, in seconds
Campaign number of contacts performed
Pdays number of days that passed by after a previous campaign
Previous number of contacts performed before this campaign
Poutcome the outcome of the previous marketing campaign
Deposit has the client subscribed to a term deposit?
There are 11,162 observations and 17 Variables in the Data Set.
3. Problem Statement / Objective
The objective of this assignment is to verify whether a client purchases a term deposit or not. Also to find
which attributes are very likely to help find whether a client will purchase a term deposit or not.
4. Methods Used
❏ ONE - HOT ENCODING
One hot encoding is a method by which categorical attributes are converted into a numeric. The reason to
perform the one hot encoding is that the ML algorithms to do a better job in prediction. It maps only the
categorical features, represented as a label index to a binary vector with at most a single one-value
indicating the presence of a specific feature value from among the set of all feature values. This encoding
allows algorithms that expect continuous features, such as Logistic Regression, to use categorical
features. For string type input data, it is common to encode categorical features using stringIndexer first.
❏ MIN - MAX NORMALIZATION
The min-max scaling or min-max normalization is the simplest method and consists of rescaling the
range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends
on the data pattern. The general formula is given as: where is an original value, is the
normalized value.
MinMaxScaler computes summary statistics on a data set and produces a
MinMaxScalerModel. The model can then transform each feature individually such that it is in the given
range.
❏ PCA
PCA stands for principal component analysis which shows the behaviors of the attributes. This is a
statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal components. I
have selected the K=2 for PCA and follow the steps I perform to build the model is pretty straight forward.
Deposit has the client subscribed to a term deposit?
There are 11,162 observations and 17 Variables in the Data Set.
3. Problem Statement / Objective
The objective of this assignment is to verify whether a client purchases a term deposit or not. Also to find
which attributes are very likely to help find whether a client will purchase a term deposit or not.
4. Methods Used
❏ ONE - HOT ENCODING
One hot encoding is a method by which categorical attributes are converted into a numeric. The reason to
perform the one hot encoding is that the ML algorithms to do a better job in prediction. It maps only the
categorical features, represented as a label index to a binary vector with at most a single one-value
indicating the presence of a specific feature value from among the set of all feature values. This encoding
allows algorithms that expect continuous features, such as Logistic Regression, to use categorical
features. For string type input data, it is common to encode categorical features using stringIndexer first.
❏ MIN - MAX NORMALIZATION
The min-max scaling or min-max normalization is the simplest method and consists of rescaling the
range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends
on the data pattern. The general formula is given as: where is an original value, is the
normalized value.
MinMaxScaler computes summary statistics on a data set and produces a
MinMaxScalerModel. The model can then transform each feature individually such that it is in the given
range.
❏ PCA
PCA stands for principal component analysis which shows the behaviors of the attributes. This is a
statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal components. I
have selected the K=2 for PCA and follow the steps I perform to build the model is pretty straight forward.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
I usually create a model using function PCA, fit the train data and predict result over on test
data. This usually perform good dimensionality reduction
5. Algorithms / Models
❏ Clustering / Unsupervised learning
K-Means Clustering
We have perfectly performed the elbow-method to select the number of clusters.
I have chosen k=2 for analysis. The steps I perform to build the model is pretty straight
forward. I usually create a model using function kmeans, fit the train data and predict
result over on test data.
❏ Classification / Supervised learning:
Logistic Regression / Generalize Linear Model
In logistic regression, we only consider when the target variable
is categorical. And we have the variable deposit in our dataset df2. But before going for
the actual model building part we split data into two parts train and test at the ratio of 70%
for train data and 30% for test data. The accuracy we got here is only Accuracy:
43.31797235023041 % which is not that much good. Because of some attributes are
shadowing our main attribute effect.
Decision Tree Model
Decision trees are a popular family of classification and regression
methods. More information about the spark.ml implementation can be found in the code
part. We have only discussed the performance of the model here. So the accuracy which
we got for this particular model is 100%.
Naive Bayes Algorithm
Naive Bayes classifiers are a family of simple probabilistic classifiers
based on applying Bayes’ theorem with strong (naive) independence assumptions
between the features. The spark.ml implementation currently supports both multinomial
naive Bayes and Bernoulli naive Bayes. More information can be found in the section on
Naive Bayes in MLlib. After building the best model using this algorithm the accuracy we
got is about Overall accuracy = 92.34393404004712% which is totally remarkable.
6. Feature Analysis & Suggestions
Few features actually affect the performance of the models. Basically, before going for any
model building part, we need to apply the feature selection methods like calculating the correlation of all
data. This usually perform good dimensionality reduction
5. Algorithms / Models
❏ Clustering / Unsupervised learning
K-Means Clustering
We have perfectly performed the elbow-method to select the number of clusters.
I have chosen k=2 for analysis. The steps I perform to build the model is pretty straight
forward. I usually create a model using function kmeans, fit the train data and predict
result over on test data.
❏ Classification / Supervised learning:
Logistic Regression / Generalize Linear Model
In logistic regression, we only consider when the target variable
is categorical. And we have the variable deposit in our dataset df2. But before going for
the actual model building part we split data into two parts train and test at the ratio of 70%
for train data and 30% for test data. The accuracy we got here is only Accuracy:
43.31797235023041 % which is not that much good. Because of some attributes are
shadowing our main attribute effect.
Decision Tree Model
Decision trees are a popular family of classification and regression
methods. More information about the spark.ml implementation can be found in the code
part. We have only discussed the performance of the model here. So the accuracy which
we got for this particular model is 100%.
Naive Bayes Algorithm
Naive Bayes classifiers are a family of simple probabilistic classifiers
based on applying Bayes’ theorem with strong (naive) independence assumptions
between the features. The spark.ml implementation currently supports both multinomial
naive Bayes and Bernoulli naive Bayes. More information can be found in the section on
Naive Bayes in MLlib. After building the best model using this algorithm the accuracy we
got is about Overall accuracy = 92.34393404004712% which is totally remarkable.
6. Feature Analysis & Suggestions
Few features actually affect the performance of the models. Basically, before going for any
model building part, we need to apply the feature selection methods like calculating the correlation of all
variables so that we can decide the significance of the variables, standard scalar method, Forward
selection method, etc.
The hypothesis testing would be the best result to solve or build any model and model selection method.
The best model performs at this level is only K-means Clustering and Naive Bayes Algorithm.
7. Conclusion
The motive of this study was to determine which variables have a high significance to predict whether a
client purchases a term deposit or not. A second to analyze was to determine the which variables helps to
that produce the most term deposit purchases. Concluding that the variables that provide the highest plan
of success are important to a bank. A bank can get the help of these variables to target Clients that would
most likely make term deposit purchases.
selection method, etc.
The hypothesis testing would be the best result to solve or build any model and model selection method.
The best model performs at this level is only K-means Clustering and Naive Bayes Algorithm.
7. Conclusion
The motive of this study was to determine which variables have a high significance to predict whether a
client purchases a term deposit or not. A second to analyze was to determine the which variables helps to
that produce the most term deposit purchases. Concluding that the variables that provide the highest plan
of success are important to a bank. A bank can get the help of these variables to target Clients that would
most likely make term deposit purchases.
1 out of 6
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.