University Project: Machine Learning for Thera Bank Loan Prediction

Verified

Added on 2022/08/20

AI Summary

This project focuses on applying machine learning techniques to predict customer behavior in the context of a personal loan campaign for Thera Bank. The analysis begins with an executive summary outlining the project's objectives, which include identifying potential customers most likely to accept a loan offer. The project utilizes supervised learning algorithms, specifically logistic regression, decision tree classifiers, random forest classifiers, Gaussian Naive Bayes, and AdaBoostClassifier, to build predictive models. The dataset includes customer demographics and their relationship with the bank. The analysis involves data preprocessing, including outlier removal, feature importance analysis, and model evaluation using metrics such as accuracy, AUC value, and ROC curves. The findings indicate that random forest and decision tree models achieved the highest accuracy, with the importance of data cleaning emphasized. The project concludes with recommendations for future data generation and model implementation to improve prediction accuracy. The project aims to assist Thera Bank in converting liability customers to personal loan customers through targeted marketing efforts.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Running head: MACHINE LEARNING
Machine Learning
Student Name:
Student ID:
University Name:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

2Machine Learning
Executive Summary
Machine learning is a field of science where computer performs task without been explicitly
programmed. Machine learning has got the power to think like human in many ways. The
objective of machine learning is to predict future instances which can benefit lot of industries
which depends of future outcome. Banking industries are one of such industry where these
analysis and prediction plays a crucial role. The purpose of these analysis is to find out the
potential customer who have the highest probability of purchasing the loan. Thus different
machine learning model have been implemented to know which model classify the classes more
accurately and also the AUC value and ROC curve also been shown during the analysis. At the
end a conclusion will be concluded regarding the analysis and outcomes of the analysis.

3Machine Learning
Table of Contents
Executive Summary.........................................................................................................................2
Introduction......................................................................................................................................4
Discussion........................................................................................................................................4
Conclusion.......................................................................................................................................8
References........................................................................................................................................9

4Machine Learning
Introduction
Machine learning is one of the method which are used for data analysis that is used for
analytical model buildup. Also it can be said that machine learning is a vital branch of AI which
is totally based on the system where computer has the ability to learn from data itself (Dey,
2016). Machine learning algorithm are mainly classifies into two different types mainly
supervised and unsupervised learning algorithms. In these analysis supervised learning has been
implemented where the data are labelled and the input and outputs are known previously
(Eweoya et al., 2019).
Banking sectors and other businesses in the financial industry uses data analysis and
machine learning technologies for mainly two purposes, one to identify important insights in data
and the other one is to prevent frauds.
Discussion
The analysis is based on a specific bank which is Thera Bank which has a growing
customer base. Majority customers of the bank are liability customers with varying size of
deposits (Humber, 2018). There are very few customer of the bank who are borrowers thus the
management wants to expand the base urgently to provide more loan business and its service to
their customer and more earning can be processed from the interest on loans (Gasso, 2019). In
the last year the bank ran a campaign for their liabilities customers which results in healthy
conversion rate of over 9% success (Kumar, 2016).
Different machine learning models has been implemented to check the accuracy, AUC
value, ROC curve which are some necessary finding to come up with a solution-

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

5Machine Learning
 Logistic Regression
 Decision Tree Classifier
 Random Forest Classifier
 Gaussian Naive Bayes
 AdaBoostClassifier
Details information about each attributes of the dataset has been shared below-
The attributes can be divided accordingly:
 The variable ID does not add any interesting information. There is no association
between a person's customer ID and loan, also it does not provide any general
conclusion for future potential loan customers. We can neglect this information for our
model prediction.
The binary category have five variables as below:
 Personal Loan – If the customer have taken personal loan which was offered in the last
campaign? This is the target variable.
 Securities Account – If there is any security account associated with the bank by the
customer?
 CD Account – if the customer has been provided with a certificate of deposit
associated with the bank?
 Online – If the customer opted for internet banking facilities?
 Credit Card – Weather the customer use credit card which has been issued by the
Universal bank?
Interval variables are as below:
 Age – Customer age
 Experience - Years of experience
 Income - Annual income in dollars
 CCAvg - Average credit card spending
 Mortage - Value of House Mortgage
Ordinal Categorical Variables are:
 Family – Size of family members

6Machine Learning
 Education – Customers education level
The nominal variable are:
 ID
 Zip Code
From the analysis it can be observed that there are many outlier present in maximum of the
attributes, thus from the income attributes all the outliers have been removed which can have a
high impact on the accuracy of the model.
Figure 1: Feature importance graph
Figure 1 shows the importance of each attribute in the dataset thus it is necessary to
remove the outlier from the income attribute. From the analysis it has been observed that only
very few customers get personal loan from the bank (Müller & Guido, 2016). The target attribute

7Machine Learning
is the personal_loan which is the attribute needs to be predict using logistic regression model at
the end of the analysis.
It has been observed that from all the three model decision tree has the highest accuracy
followed by random forest classifier (Turkson, Baagyere & Wenya, 2016). These classifiers has
the highest accuracy, the accuracy may vary depending on the kernel and system specification.
The accuracy of the logistic regression was observed to be 94.6% which is quite high in general
and can be considered as good model for the dataset.
Figure 2: Confusion matrix of Logistic Regression model
Figure 2 shows the confusion matrix of the logistic regression performed in the analysis.
Decision tree is termed to be as one of the best and widely used supervised learning
algorithm (Yeo & Grant, 2018). Tree based models can gain high accuracy stability and ease of
interpretation due to its structure and the ability to solve both classification and regression trees.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

8Machine Learning
Conclusion
From the analysis it can be concluded that random forest and decision tree has gained the
highest accuracy due to their characteristics. The dataset contains more outliers thus it is
important to get a clean dataset in order to get higher prediction accuracy. Different model needs
to be implemented to identify the best model, also in these analysis most of the model gained an
accuracy of above 90%. In future huge data needs to be generate for different findings.

9Machine Learning
References
Dey, A. (2016). Machine learning algorithms: a review. International Journal of Computer
Science and Information Technologies, 7(3), 1174-1179.
Eweoya, I. O., Adebiyi, A. A., Azeta, A. A., & Azeta, A. E. (2019, August). Fraud prediction in
bank loan administration using decision tree. In Journal of Physics: Conference Series
(Vol. 1299, No. 1, p. 012037). IOP Publishing.
Gasso, G. (2019). Logistic regression.
Humber, M. (2018). Personal Finance with Python.
Kumar, A. (2016). Learning predictive analytics with Python. Packt Publishing Ltd.
Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for
data scientists. " O'Reilly Media, Inc.".
Turkson, R. E., Baagyere, E. Y., & Wenya, G. E. (2016, September). A machine learning
approach for predicting bank credit worthiness. In 2016 Third International Conference
on Artificial Intelligence and Pattern Recognition (AIPR) (pp. 1-7). IEEE.
Yeo, B., & Grant, D. (2018). Predicting service industry performance using decision tree
analysis. International Journal of Information Management, 38(1), 288-300.