Solution: INFO411 Data Mining Assignment 1, University of Wollongong

Verified

Added on 2023/01/23

AI Summary

This document presents a comprehensive solution to a data mining assignment, focusing on credit rating prediction using the provided dataset. The solution begins with data import, cleaning (handling missing values and outliers), and exploratory data analysis. K-means clustering is applied for initial data exploration. The core of the solution involves building a logistic regression model to predict credit ratings, including feature selection and model evaluation. The document reports the accuracy achieved and discusses strategies to maximize prediction accuracy, including feature selection and algorithm tuning. The document also explains why 100% accuracy is not achievable in the given context, and provides insights on how to improve the model's performance. The solution provides a detailed analysis of the model's performance and provides a breakdown of the results and discusses strategies for improving model accuracy.

## Task- 2 ####
############################################## Question 1:
##############################################################
# Import file creditworthiness.csv
credit_data <- read.csv("/Path of your folder/creditworthiness.csv")
View(credit_data)
names(credit_data) # Column names / Varibale names
dim(credit_data) # There are 2500 observations and 46 variables
str(credit_data)
summary(credit_data) # Summary of the dataset
##list of rows with missing values
credit_data[!complete.cases(credit_data),]
##list of columns with missing values
credit_data[,!complete.cases(credit_data)]
## Discard missingvalues
credit_data <- na.omit(credit_data,na.action=TRUE)
# First check the complete set of components for outliers
boxplot(credit_data)
# outlier in savings.on.other.accounts
boxplot(credit_data[,c(8)])
## Phrase the function to replace outliers
library(data.table)
outlierReplace = function(credit_data, cols, rows, newValue = NA) {
if (any(rows)) {
set(credit_data, rows, cols, newValue)
}
}
# gender vs credit rating
counts <- table(credit_data$gender, credit_data$credit.rating)
barplot(counts, main="Credit Rating Distribution",
xlab="Categories", col=c("gray", "blue"), legend = rownames(counts),
beside=TRUE)
#Convert credit.rating variable as categorical variable
credit_data$credit.rating = as.factor(credit_data$credit.rating )
### k-means clustering
fit <- kmeans(credit_data,6)
fit
## checking withinss i.e. the intra cluster bond strength factor for each
cluster
fit$withinss
## checking betweenss i.e. the inter cluster distance between cluster

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

fit$betweenss
library(cluster)
library(fpca)
clusplot(credit_data, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Preparing the Dataset for Prediction
#install.packages("caTools")
library(caTools)
# Spliting of dataset in ratio of 70% train data and 30% test data.
set.seed(100)
split_data = sample.split(credit_data$credit.rating, 0.7)
train = subset(credit_data, split_data == TRUE)
test = subset(credit_data, split_data == FALSE)
# Creating the Logistic Model and checking the model summary
modeltest = glm(credit.rating ~. -savings.on.other.accounts, data=train,
family=binomial(link = 'logit'))
summary(modeltest)
# After checking Significance Level of the Variables analyse Revised Logistic
Regression Model
best_model = glm(credit.rating ~
re.balanced..paid.back..a.recently.overdrawn.current.acount +
max..account.balance.7.months.ago + FI3O.credit.score + gender +
min..account.balance.12.months.ago, data=train, family="binomial")
summary(best_model)
# prediction on test data
predicted_value = predict(best_model, newdata=test, type="response", na.action =
na.pass)
# Density of probabilities
ggplot(data.frame(predicted_value) , aes(predicted_value)) +
geom_density(fill = 'lightblue' , alpha = 0.4) +
labs(x = 'Predicted Probabilities on test set')
# Measuring the accuracy of the model
table(as.numeric(predicted_value >= 0.5), test$credit.rating)
# Computing Accuracy of the Model
# Total accuracy = Total correct predictions / total predictions
Total_Accuracy = (62 + 144 + 281 + 118)/(99 + 1 + 10 + 35 + 62 + 144 + 281 +
118)
# Total_Accuracy = 0.8066667 = 80.66%
Error_rate = 1-Total_Accuracy
Error_rate
# Error_rate = 0.1933333
################################################## Question -2
#####################################################
# a.) Describe a valid strategy that maximises the accuracy of predicting the
credit rating. Explain why your strategy can be expected to maximize the
prediction capabilities.

# Ans: 1. Having more clarity of the data is always important to get good
accuracy.
# 2. Selection of feature's that is variables is also play an vital role to
get good accuracy of the model.
# 3. we can use Algorithm tuning method to reset the parameters inside the
model class and try to get good insight.
# 4. For better performance of the credit company I need strategy that can
be maximize the prediction capabilities.
# b.) Use your strategy to train MLP(s) then report your results. Give an
interpretation of your results. What is the best classification accuracy
(expressed in % of correctly classified data) that you can obtain for data that
were not used during training (i.e. the test set)?
# Ans: With using Logistic Regression to analyse the categorical target
variable, the best classification accuracy for this usecase is = 80.66%.
# c.) You will find that 100% accuracy cannot be obtained on the test data.
Explain reasons to why a 100% accuracy could not be obtained on this test
dataset. What would be needed to get the prediction accuracy closer to 100%?
# Ans: Because in dataset there are no more than 5 significant variables as
compare to the total variables i,e 47. If the significancy level of variables
increases then might be the accuracy of model could be reached to 100%.

1 out of 3

Solution: INFO411 Data Mining Assignment 1, University of Wollongong

Secure Best Marks with AI Grader

Related Documents

Statistical Modeling and Analysis of Crime Data Using R Programming

Estimation of Crimes in the USA in 1990-92 Regression Model Analysis

+13062052269

info@desklib.com