Predictive Analytics Report: Portuguese Banking Institution Data

Verified

Added on 2022/08/15

AI Summary

This report provides a comprehensive analysis of predictive analytics within the banking sector, focusing on the application of machine learning models to predict client subscriptions to term deposits. The study utilizes a dataset related to direct marketing campaigns of a Portuguese banking institution. The analysis begins with data preprocessing and visualization, followed by the construction and comparison of five different machine learning models: Naive Bayes, Random Forest, Decision Tree, Support Vector Machine (SVM), and Logistic Regression. The report details the process, including the use of the R programming language, and addresses key questions such as dataset balance, the presence of missing values, and outlier detection. The results highlight the Naive Bayes model as the most accurate, achieving a classification accuracy of 99.02%. The report also provides a comparison of the performance of other models and discusses the importance of normalization before model building. The report concludes with references to relevant academic literature.

Running head: Predictive Analytics
Predictive Analytics
Student Name:
Student ID:
University Name:
Paper Code:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2Predictive Analytics
Executive Summary
Machine learning and artificial intelligence are making through different industries to benefit
their huge customers, one of the industries which has adapted these technology for various
benefits are banking sectors. These sector are using predictive analysis for different budgeting,
prevention and approval purposes. This analysis uses a dataset which is related with direct
marketing campaigns of a Portuguese banking institution. The classification goal is to predict if
the client will subscribe a term deposit (variable y). Different machine learning models has been
built to see which model classifies more accurately and with that model the prediction will be
done.

3Predictive Analytics
Table of Contents
Executive Summary.......................................................................................................................2
Introduction....................................................................................................................................4
Process............................................................................................................................................5
Results.............................................................................................................................................8
References.....................................................................................................................................10

4Predictive Analytics
Introduction
Machine learning has given the machine the power to think by itself without any human
intervention. Machine learning model are used for predicting future events and also for
forecasting purposes (Biau and Scornet, 2016). Different programs purely depends on machine
learning and data mining techniques to analyze and visualize huge amount of information.
The dataset used can be said as good dataset which does not have any missing or null
values. The dataset which is used is directly related with marketing campaign which is mainly a
Portuguese banking institution. The campaign was based on phone calls not by direct interaction.
In order to access if the product would be subscribed or not more than one contact for the same
client need to be recorded. The classification goal is to predict if the client will subscribe a term
deposit (variable y).
Step by step analysis have been performed starting from importing all the necessary
libraries then loading the dataset then data pre-processing after that data visualization (Bischl et
al., 2016). At the end 5 different machine learning models have been build using different
classification and regression model to correctly examine the target variable.
Few of the question which can be raised while the analysis are being performed includes-
1. Is the dataset balanced?
2. Is the dataset clean or if the dataset contains any null values which can impact in the
accuracy of the model.
3. Are there any outlier present in the dataset?
4. Which model produced the highest accuracy and why?
5. Is the dataset normalized before building the model?

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

5Predictive Analytics
6. How to choose a model is best fit for the dataset?
7. Comparison between all the models need to be establish.
Process
Using R program different analytics and decision making problems are solved. After
analysis it can be said that the dataset is unbalanced dataset. For this dataset the number of yes
and no are unequal thus from this it can be concluded that the dataset is unbalanced.
Figure 1: Null or missing value percentage
From the analysis and from the figure 1 it can be said that the dataset contains no missing
values or null values as the figure indicates 0% for all of the attributes present in the dataset. As
there are no missing values, then it can be predicted that the accuracy of each of the model will
be good.

6Predictive Analytics
The dataset contains outliers in many attributes but this will not be reflected in the
performance of the model. In the analysis a visualization have been shown with dependent
variable vs independent variable which is shown below.
Figure 2: Dependent vs independent variable
One of best tree-based model which uses divide and conquer method for learning pattern
is the decision tree classifier (Ghatak, 2017). These types of model use branches and leaf
structure with outcome of a decision weather its yes or no (Nuti, Rugama and Cross, 2019).
SVM deals with hyperplane as the main purpose of SVM is to draw a line that will
separate two of the classes from one another (Rakotomamonjy, 2003). Also, if found more that
one hyperplane then the maximum margin hyperplane was searched and taken for consideration

7Predictive Analytics
that will help to create the largest separation between the two classes (Tanha, Someren and
Afsarmanesh).
Naive Bayes algorithm assumes that all the features in the dataset are equally important
and independent, thus nativity (Rish, 2001).
The algorithm which uses huge number of individual decision trees that acts in a group
for prediction purpose is the random forest classifier (Tanha, Someren and Afsarmanesh, 2017).
When the dependent variable is binary then the best regression model is the logistic
regression. It is use mainly in predictive analysis (Wainberg, Alipanahi and Frey, 2016).
According to the analysis it has been found that the Naive Bayes model produced the
highest classification accuracy of 99.02% as because Naive Bayes algorithm is a collection of
classification algorithm which is based on Bayes theorem.
Also, normalization is essential before building the model as the main motive is to
convert the numerical values of the dataset to a common scale without any changes in the ranges
of the values.
The best model can be chosen from the confusion matrix by observing the number of
correctly classified classes and also from the accuracy attribute from the analysis. All the result
of the analysis has been shared later on.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

8Predictive Analytics
Results
And from the analysis it has been observed that Naive Bayes is the best model having
99.02% accuracy then random forest classifier with 90.87% accuracy then decision tree with an
accuracy of 90.33% accuracy then SVM with an accuracy of 89.23% and at the end the logistic
regression which produces an accuracy of 88.56%.

9Predictive Analytics
References
Biau, G. and Scornet, E., 2016. A random forest guided tour. Test, 25(2), pp.197-227.
Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G. and
Jones, Z.M., 2016. mlr: Machine Learning in R. The Journal of Machine Learning Research,
17(1), pp.5938-5942.
Ghatak, A., 2017. Machine learning with R. Singapore: Springer Singapore.
Nuti, G., Rugama, L.A.J. and Cross, A.I., 2019. A Bayesian Decision Tree Algorithm. arXiv
preprint arXiv:1901.03214.
Rakotomamonjy, A., 2003. Variable selection using SVM-based criteria. Journal of machine
learning research, 3(Mar), pp.1357-1370.
Rish, I., 2001, August. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop
on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
Tanha, J., van Someren, M. and Afsarmanesh, H., 2017. Semi-supervised self-training for
decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1),
pp.355-370.

10Predictive Analytics
Wainberg, M., Alipanahi, B. and Frey, B.J., 2016. Are random forests truly the best classifiers?
The Journal of Machine Learning Research, 17(1), pp.3837-3841.