Exploratory data analysis The results below present the summary statistics for the variables. As can be seen, two variables (Number of Open Credit Lines and Loans and number of dependents) had missing values. > summary(credit_data) RevolvingAgeDebtRatioIncomeLoans Min.:0.00Min.:0.00Min.: 0.0000Min.:0.0Min.:0 1st Qu.:0.031st Qu.: 41.001st Qu.: 0.00001st Qu.:0.21st Qu.:3400 Median :0.15Median : 52.00Median : 0.0000Median :0.4Median :5400 Mean:5.75Mean: 52.34Mean: 0.4343Mean:349.6Mean:6745 3rd Qu.:0.563rd Qu.: 63.003rd Qu.: 0.00003rd Qu.:0.93rd Qu.:8212 Max.:50708.00Max.:109.00Max.:98.0000Max.:329664.0Max.:7727000 NA's:49834 TimesLateRealEstateLoansWorseDependentsNA Min.: 0.000Min.: 0.0000Min.: 0.000Min.: 0.0000Min.: 0.000 1st Qu.: 5.0001st Qu.: 0.00001st Qu.: 0.0001st Qu.: 0.00001st Qu.: 0.000 Median : 8.000Median : 0.0000Median : 1.000Median : 0.0000Median : 0.000 Mean: 8.453Mean: 0.2784Mean: 1.016Mean: 0.2525Mean: 0.762 3rd Qu.:11.0003rd Qu.: 0.00003rd Qu.: 2.0003rd Qu.: 0.00003rd Qu.: 1.000 Max.:85.000Max.:98.0000Max.:54.000Max.:98.0000Max.:43.000 NA's:6550 The histogram presented below shows that only the age of the applicant follows a normal distribution. The other three variables are heavily skewed to the right (longer tail to the right). The figure below shows that all the four presented variables are heavily skewed to the right (longer tail to the right).
Checking for outliers Part of the data exploratory entailed checking for outliers and as can be seen from the boxplots presented below, all the four variables had presence of outliers in them. There was also outliers in the next bunch of four variables presented in the boxplots below.
Credit scoring model In this section, we present the results of the credit-scoring model in which SeriousDlqin2yrs is used as a target (default). The method employed is a logistic regression since the outcome variable is a binary type of variable where we assume that the participant is either go to default on the loan or not. The first model included all the explanatory variables where it was established that two of the variables (Revolving Utilization of Unsecured Lines and Number of Open Credit Lines and Loans) were insignificant in the model and had to be dropped off. > summary(model) Call: glm(formula = Deliquence ~ ., family = binomial(), data = training) Deviance Residuals: Min1QMedian3QMax -3.1639-0.3988-0.3268-0.26475.1913 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)-1.472e+004.630e-02 -31.801< 2e-16 *** Revolving-5.704e-057.700e-05-0.7410.45885 Age-2.517e-029.290e-04 -27.093< 2e-16 *** Worse305.007e-011.193e-0241.977< 2e-16 *** DebtRatio-1.514e-044.887e-05-3.0980.00195 ** Income-4.214e-053.290e-06 -12.809< 2e-16 *** Loans-5.086e-032.727e-03-1.8650.06220 . TimesLate4.426e-011.679e-0226.365< 2e-16 *** RealEstateLoans8.428e-021.111e-027.587 3.28e-14 *** Worse60-9.067e-011.942e-02 -46.699< 2e-16 *** Dependents1.035e-019.667e-0310.704< 2e-16 *** ---
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 60689on 120268degrees of freedom Residual deviance: 56019on 120258degrees of freedom (29731 observations deleted due to missingness) AIC: 56041 Number of Fisher Scoring iterations: 6 The second model encompasses only the significant variables only. The results are presented below. > summary(model2) Call: glm(formula = Deliquence ~ Age + Worse30 + DebtRatio + Income + TimesLate + RealEstateLoans + Worse60 + Dependents, family = "binomial", data = training) Deviance Residuals: Min1QMedian3QMax -3.1739-0.3989-0.3273-0.26485.2252 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)-1.487e+004.574e-02 -32.499< 2e-16 *** Age-2.550e-029.127e-04 -27.940< 2e-16 *** Worse304.979e-011.183e-0242.098< 2e-16 *** DebtRatio-1.535e-044.892e-05-3.1380.0017 ** Income-4.295e-053.272e-06 -13.127< 2e-16 *** TimesLate4.461e-011.671e-0226.699< 2e-16 *** RealEstateLoans7.666e-021.033e-027.418 1.19e-13 *** Worse60-9.071e-011.943e-02 -46.677< 2e-16 *** Dependents1.034e-019.669e-0310.696< 2e-16 *** --- Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 60689on 120268degrees of freedom Residual deviance: 56023on 120260degrees of freedom (29731 observations deleted due to missingness) AIC: 56041 Number of Fisher Scoring iterations: 6 From the above results, it is clear that four of the variables (age, debt ratio, monthly income and Number of Time 60-89 Days Past Due Not Worse) have a positive impact on the likelihood of default. That is, a unit increase in the either of the four mentioned variables is likely to result in a decrease in the chances of default. The other four variables (Number of Time 30-59 Days Past Due Not Worse, Number of Times 90 Days Late, Number Real Estate Loans or Lines and Number of Dependents) were found to be positively significant in the model. That is, a unit increase in the either of the four mentioned variables is likely to result in an increase in the chances of default. Model performance The overall accuracy of the model was found to be 67.28%. This is not very good accuracy level.