Data Science Project: Credit Card Application Analysis, Assessment 2
VerifiedAdded on 2022/09/24
|9
|2745
|19
Project
AI Summary
This project analyzes a dataset of credit card applications and approval decisions using R programming. The project begins with importing the data from a publicly available source, handling variable types, and addressing missing values. It then calculates and visualizes proximity measurements using the daisy function and converts the Gower dissimilarity object into a distance matrix. The analysis includes visual exploration of data patterns and relationships using ggplot2, such as boxplots and bar plots, to understand the distributions of different variables (like account balance, monthly expenses, credit score, age, employment status, marital status, and banking institution) in relation to credit approval. Furthermore, the project calculates the Simple Matching Coefficient (SMC) to assess the similarity in approval rates based on prior defaults and employs a two-sample t-test to compare the monthly income between approved and declined applications. The findings reveal insights into factors influencing credit approval, such as prior default history, employment status, and income levels, along with potential data patterns and correlations.

Running head: ASSESSMENT 2 1
Tutorial Project – Assessment 2
Name:
Institution:
Tutorial Project – Assessment 2
Name:
Institution:
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

ASSESSMENT 2 2
Table 1 Loading the data R code
Table 2 R code
Table 3 Use the following R code to manually define the variables
Table 4
Data <- read.table("~/crx.data", header=FALSE, na.strings = "?" , sep= ",")
names(Data) <- c("Gender", "Age", "MonthlyExpenses", "MaritalStatus",
"HomeStatus", "Occupation", "BankingInstitution", "YearsEmployed",
"NoPriorDefault", "Employed", "CreditScore", "DriversLicense", "AccountType",
"MonthlyIncome", "AccountBalance", "Approved")
Data$Gender <- as.factor(Data$Gender)
Data$Age <- as.numeric(Data$Age)
Data$MonthlyExpenses <- as.integer(Data$MonthlyExpenses)
Data$MaritalStatus <- as.factor(Data$MaritalStatus)
Data$HomeStatus <- as.factor(Data$HomeStatus)
Data$Occupation <- as.factor(Data$Occupation)
Data$BankingInstitution <-as.factor(Data$BankingInstitution)
Data$YearsEmployed <- as.numeric(Data$YearsEmployed)
Data$NoPriorDefault <- as.factor(Data$NoPriorDefault)
Data$Employed <- as.factor(Data$Employed)
Data$CreditScore <- as.numeric(Data$CreditScore)
Data$DriversLicense <- as.factor(Data$DriversLicense)
Data$AccountType <- as.factor(Data$AccountType)
Data$MonthlyIncome <- as.integer(Data$MonthlyIncome)
Data$AccountBalance <- as.numeric(Data$AccountBalance)
Data$Approved <- as.factor(Data$Approved)
Table 1 Loading the data R code
Table 2 R code
Table 3 Use the following R code to manually define the variables
Table 4
Data <- read.table("~/crx.data", header=FALSE, na.strings = "?" , sep= ",")
names(Data) <- c("Gender", "Age", "MonthlyExpenses", "MaritalStatus",
"HomeStatus", "Occupation", "BankingInstitution", "YearsEmployed",
"NoPriorDefault", "Employed", "CreditScore", "DriversLicense", "AccountType",
"MonthlyIncome", "AccountBalance", "Approved")
Data$Gender <- as.factor(Data$Gender)
Data$Age <- as.numeric(Data$Age)
Data$MonthlyExpenses <- as.integer(Data$MonthlyExpenses)
Data$MaritalStatus <- as.factor(Data$MaritalStatus)
Data$HomeStatus <- as.factor(Data$HomeStatus)
Data$Occupation <- as.factor(Data$Occupation)
Data$BankingInstitution <-as.factor(Data$BankingInstitution)
Data$YearsEmployed <- as.numeric(Data$YearsEmployed)
Data$NoPriorDefault <- as.factor(Data$NoPriorDefault)
Data$Employed <- as.factor(Data$Employed)
Data$CreditScore <- as.numeric(Data$CreditScore)
Data$DriversLicense <- as.factor(Data$DriversLicense)
Data$AccountType <- as.factor(Data$AccountType)
Data$MonthlyIncome <- as.integer(Data$MonthlyIncome)
Data$AccountBalance <- as.numeric(Data$AccountBalance)
Data$Approved <- as.factor(Data$Approved)

ASSESSMENT 2 3
(c) Records with missing values
Table 5
Table 6
1. Calculating and visualising proximity measurements
Table 7
The approval or disapproval of the credit to either a male or a female applicant has equal
chances. This means that this variable is symmetric as no gender is given more weight than
the other. On the other hand, having a driving license could have been used as a criterion to
approve the credit card. Those with a license would have been considered for approval more
likely than those without driving licenses. This makes the variable asymmetrical.
Data <- na.omit(Data)
Before NA’s are removed the length of gender was 690, and after missing values are
removed it had a length of 653. Thus, 37 responses did not have a complete record.
The number of NA’s was 67, in which some responses had more than one NA.
(c) Records with missing values
Table 5
Table 6
1. Calculating and visualising proximity measurements
Table 7
The approval or disapproval of the credit to either a male or a female applicant has equal
chances. This means that this variable is symmetric as no gender is given more weight than
the other. On the other hand, having a driving license could have been used as a criterion to
approve the credit card. Those with a license would have been considered for approval more
likely than those without driving licenses. This makes the variable asymmetrical.
Data <- na.omit(Data)
Before NA’s are removed the length of gender was 690, and after missing values are
removed it had a length of 653. Thus, 37 responses did not have a complete record.
The number of NA’s was 67, in which some responses had more than one NA.
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

ASSESSMENT 2 4
Table 8 Use the R code to convert the Gower dissimilarity object into a distance matrix
Table 9
.
Table 10 Use the following R code to visualise the distance matrix
##visualising proximity measurements
library(cluster)
Dist <- daisy(Data, metric = "gower")
Dist <- as.matrix(Dist)
Dist <- as.matrix(Dist)
Dist[10,60]
[1] 0.3962307
dim <- ncol(Dist) # used to define axis in image
image(1:dim, 1:dim, Dist, axes = FALSE, xlab="", ylab="", col = rainbow(100))
heatmap(Dist, Rowv=TRUE, Colv="Rowv", symm = TRUE)
Table 8 Use the R code to convert the Gower dissimilarity object into a distance matrix
Table 9
.
Table 10 Use the following R code to visualise the distance matrix
##visualising proximity measurements
library(cluster)
Dist <- daisy(Data, metric = "gower")
Dist <- as.matrix(Dist)
Dist <- as.matrix(Dist)
Dist[10,60]
[1] 0.3962307
dim <- ncol(Dist) # used to define axis in image
image(1:dim, 1:dim, Dist, axes = FALSE, xlab="", ylab="", col = rainbow(100))
heatmap(Dist, Rowv=TRUE, Colv="Rowv", symm = TRUE)
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

ASSESSMENT 2 5
Table 11 Insert the image(s) of the distance matrix below, then Describe the pattern you see
when visualising it(them). Marks (1)
Table 12
2. Visually exploring data patterns and relationships
library(ggplot2) Enter your answer here
The heat map indicates that there are some clustering data points. This shows that there is
some correlation or similarities between peoples applying the loans. Thus, some samples will
express the same relationship in a cluster.
Num.Data <- Data[,c(2,3,8,14,15)]
cor(Num.Data, method = "pearson")
cor(Num.Data, method = "spearman")
#Alternative approach
library(Hmisc)
rcorr(as.matrix(Num.Data),type="pearson")
rcorr(as.matrix(Num.Data),type="spearman")
Enter your answer here
Table 11 Insert the image(s) of the distance matrix below, then Describe the pattern you see
when visualising it(them). Marks (1)
Table 12
2. Visually exploring data patterns and relationships
library(ggplot2) Enter your answer here
The heat map indicates that there are some clustering data points. This shows that there is
some correlation or similarities between peoples applying the loans. Thus, some samples will
express the same relationship in a cluster.
Num.Data <- Data[,c(2,3,8,14,15)]
cor(Num.Data, method = "pearson")
cor(Num.Data, method = "spearman")
#Alternative approach
library(Hmisc)
rcorr(as.matrix(Num.Data),type="pearson")
rcorr(as.matrix(Num.Data),type="spearman")
Enter your answer here

ASSESSMENT 2 6
ggplot(Data, aes(x=AccountBalance,
y=Approved )) +
geom_boxplot(outlier.colour="red",
outlier.shape = 8, outlier.size=3) + labs(title
= "Account Balance Distribution by
Approval")
ggplot(Data, aes(x=MonthlyExpenses,
y=Approved )) +
geom_boxplot(outlier.colour="blue",
outlier.shape=8, outlier.size=4) + labs(title =
"Monthly Expenses Distribution by
Approval")
ggplot(Data, aes(x=CreditScore, y=Approved
)) +
geom_boxplot(outlier.colour="green",
outlier.shape = 4, outlier.size=4) + labs(title
= "Credit Score Distribution by Approval")
ggplot(Data, aes(x=Age, y=Approved )) +
geom_boxplot(outlier.colour="red",
outlier.shape=4,
outlier.size=4) + labs(title = "Age
Distribution by Approval")
ggplot(Data, aes(x=AccountBalance,
y=Approved )) +
geom_boxplot(outlier.colour="red",
outlier.shape = 8, outlier.size=3) + labs(title
= "Account Balance Distribution by
Approval")
ggplot(Data, aes(x=MonthlyExpenses,
y=Approved )) +
geom_boxplot(outlier.colour="blue",
outlier.shape=8, outlier.size=4) + labs(title =
"Monthly Expenses Distribution by
Approval")
ggplot(Data, aes(x=CreditScore, y=Approved
)) +
geom_boxplot(outlier.colour="green",
outlier.shape = 4, outlier.size=4) + labs(title
= "Credit Score Distribution by Approval")
ggplot(Data, aes(x=Age, y=Approved )) +
geom_boxplot(outlier.colour="red",
outlier.shape=4,
outlier.size=4) + labs(title = "Age
Distribution by Approval")
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

ASSESSMENT 2 7
Table 14
Table 15
ggplot(Data, aes(Employed)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Employed", x =
'Employed')
ggplot(Data, aes(MaritalStatus)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Marital Status", x
= "Marital Status")
The account balance has a lot of outliers on the right side. Also, the box and whiskers are
very close to each other indicating that the data might be non-normal. The monthly expenses
(for both approved and not approved) have outliers and those that their loans were approved
had a higher median monthly expenditure.
A similar trend is seen on the credit score and age. With people in approved group having a
higher median credit score as well as age. In all the cases, there were outliers on the right-
hand side of the plot.
Table 14
Table 15
ggplot(Data, aes(Employed)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Employed", x =
'Employed')
ggplot(Data, aes(MaritalStatus)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Marital Status", x
= "Marital Status")
The account balance has a lot of outliers on the right side. Also, the box and whiskers are
very close to each other indicating that the data might be non-normal. The monthly expenses
(for both approved and not approved) have outliers and those that their loans were approved
had a higher median monthly expenditure.
A similar trend is seen on the credit score and age. With people in approved group having a
higher median credit score as well as age. In all the cases, there were outliers on the right-
hand side of the plot.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

ASSESSMENT 2 8
ggplot(Data, aes(BankingInstitution)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Banking
Institution", x = "Banking Institution")
ggplot(Data, aes(NoPriorDefault)) +
geom_bar(aes(fill = Approved)) + labs(title
= "A bar plot of No Prior Default", x = "No
Prior Default")
Table 16
Table 17
Those with no prior default has higher chances of credit cards being approved, and those with
prior default history were more likely to be declined. All banking institutions indicated a
similar trend on approval. There were almost equal chances of approval or decline. The
approval rate among those that were single (u) they had almost equal chances of being
approved. Those that were married had a lower chance of being approved. The employed
people were more likely to be approved and those unemployed lower chance.
#Simple Matching Coefficient (SMC)
ggplot(Data, aes(BankingInstitution)) +
geom_bar(aes(fill = Approved)) +
labs(title = "A bar plot of Banking
Institution", x = "Banking Institution")
ggplot(Data, aes(NoPriorDefault)) +
geom_bar(aes(fill = Approved)) + labs(title
= "A bar plot of No Prior Default", x = "No
Prior Default")
Table 16
Table 17
Those with no prior default has higher chances of credit cards being approved, and those with
prior default history were more likely to be declined. All banking institutions indicated a
similar trend on approval. There were almost equal chances of approval or decline. The
approval rate among those that were single (u) they had almost equal chances of being
approved. Those that were married had a lower chance of being approved. The employed
people were more likely to be approved and those unemployed lower chance.
#Simple Matching Coefficient (SMC)

ASSESSMENT 2 9
Table 18
An assessment is carried out to determine the distribution of disposable income. The
histogram is illustrated below.
This shows that most of the people have lower monthly income (less than 500). Also, the
plot indicates that there are some extreme monthly incomes. Further analysis was carried out
to determine whether average monthly income was significantly different among those that
were approved and those that were not approved. The results are;
Two Sample t-test
data: Data$MonthlyIncome[Data$Approved == "+"] and
Data$MonthlyIncome[Data$Approved == "-"]
t = -2.1822, df = 651, p-value = 0.02945
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-54.690901 -2.883783
sample estimates:
mean of x mean of y
164.6216 193.4090
The results suggest that the that the group with declined applications had significantly
higher monthly income than those that were approved. This call for further analysis of
whether the presence of the extreme values affected these results.
Table 18
An assessment is carried out to determine the distribution of disposable income. The
histogram is illustrated below.
This shows that most of the people have lower monthly income (less than 500). Also, the
plot indicates that there are some extreme monthly incomes. Further analysis was carried out
to determine whether average monthly income was significantly different among those that
were approved and those that were not approved. The results are;
Two Sample t-test
data: Data$MonthlyIncome[Data$Approved == "+"] and
Data$MonthlyIncome[Data$Approved == "-"]
t = -2.1822, df = 651, p-value = 0.02945
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-54.690901 -2.883783
sample estimates:
mean of x mean of y
164.6216 193.4090
The results suggest that the that the group with declined applications had significantly
higher monthly income than those that were approved. This call for further analysis of
whether the presence of the extreme values affected these results.
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 9
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2026 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.


