Data set name:
Credit Card Default Data
Description:
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
Usage:
Default
Format:
A data frame with 10000 observations on the following 4 variables.
default
A factor with levels No and Yes indicating whether the customer defaulted on their debt.
balance
The average balance that the customer has remaining on their credit card after making their monthly payment.
income
Income of customer.
Source:
Simulated data
References:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, http://www.StatLearning.com Springer-Verlag, New York
Examples:
summary(Default)
glm(default~student+balance+income,family=“binomial”,data=Default)
Data download page:
http://vincentarelbundock.github.io/Rdatasets/datasets.html
knitr::opts_chunk$set(echo = TRUE)
library(readr)
input_data <- read_csv("Default.csv", col_types = cols(X1 = col_skip(),
balance = col_number(), default = col_character(),
income = col_number(), student = col_character()))
## Warning: Missing column names filled in: 'X1' [1]
dim(input_data)
## [1] 10000 4
str(input_data)
## tibble [10,000 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ default: chr [1:10000] "No" "No" "No" "No" ...
## $ student: chr [1:10000] "No" "Yes" "No" "No" ...
## $ balance: num [1:10000] 730 817 1074 529 786 ...
## $ income : num [1:10000] 44362 12106 31767 35704 38463 ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_skip(),
## .. default = col_character(),
## .. student = col_character(),
## .. balance = col_number(),
## .. income = col_number()
## .. )
summary(input_data)
## default student balance income
## Length:10000 Length:10000 Min. : 0.0 Min. : 772
## Class :character Class :character 1st Qu.: 481.7 1st Qu.:21340
## Mode :character Mode :character Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
head(input_data)
## # A tibble: 6 x 4
## default student balance income
## <chr> <chr> <dbl> <dbl>
## 1 No No 730. 44362.
## 2 No Yes 817. 12106.
## 3 No No 1074. 31767.
## 4 No No 529. 35704.
## 5 No No 786. 38463.
## 6 No Yes 920. 7492.
tail(input_data)
## # A tibble: 6 x 4
## default student balance income
## <chr> <chr> <dbl> <dbl>
## 1 No Yes 172. 14956.
## 2 No No 712. 52992.
## 3 No No 758. 19661.
## 4 No No 845. 58636.
## 5 No No 1569. 36669.
## 6 No Yes 201. 16863.
sapply(input_data,mode)
## default student balance income
## "character" "character" "numeric" "numeric"
lapply(input_data[,c(3,4)],mean)
## $balance
## [1] 835.3749
##
## $income
## [1] 33516.98
lapply(input_data[,c(3,4)],median)
## $balance
## [1] 823.637
##
## $income
## [1] 34552.64
require(modeest)
## Loading required package: modeest
lapply(input_data[,3],mfv)
## $balance
## [1] 0
# Variable $Income has no mode. Each value repeats only once.
lapply(input_data[,c(3,4)],min)
## $balance
## [1] 0
##
## $income
## [1] 771.9677
lapply(input_data[,c(3,4)],max)
## $balance
## [1] 2654.323
##
## $income
## [1] 73554.23
lapply(input_data[,c(3,4)],range)
## $balance
## [1] 0.000 2654.323
##
## $income
## [1] 771.9677 73554.2335
lapply(input_data[,c(3,4)],var)
## $balance
## [1] 233980.2
##
## $income
## [1] 177865955
lapply(input_data[,c(3,4)],sd)
## $balance
## [1] 483.715
##
## $income
## [1] 13336.64
lapply(input_data[,c(3,4)],mad)
## $balance
## [1] 507.5173
##
## $income
## [1] 16350.86
These box plots reveal the mean value, minimum value, and maximum value of each variable.
Customers with higher average income are less likely to default on their credit cards. Customers with lower average balance are less likely to default on their credit cards. Students have higher average balances than non-student customers. Students have lower average incomes than non-student customers.
library(ggplot2)
ggplot(data = input_data, aes(x=, y=balance, fill=default)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Balance") + ggtitle("Balance by Default Status")
library(ggplot2)
ggplot(data = input_data, aes(x=, y=income, fill=default)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Income") + ggtitle("Income by Default Status")
ggplot(data = input_data, aes(x=, y=balance, fill=student)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Balance") + ggtitle("Balance by Student Status")
library(ggplot2)
ggplot(data = input_data, aes(x=, y=income, fill=student)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Income") + ggtitle("Income by Student Status")
These histogram plots reveal the relationships between customer student status, default status, income and balance.
Majority of customers who default on their credit cards have balance of about $1,800. Majority of customers who default on their credit cards have income of about $19,000. Majority of the customers are non-students. Students have much lower balances than non-student customers. Students have much lower incomes than non-student customers.
library(ggplot2)
ggplot(data = input_data, aes(x=balance, fill=default, color=default)) + geom_histogram(alpha=0.6)+ xlab("Balance") + ylab("Customer Count") + ggtitle("Balance by Default Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = input_data, aes(x=income, fill=default, color=default)) + geom_histogram(alpha=0.6) + xlab("Income") + ylab("Customer Count") + ggtitle("Income by Default Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = input_data, aes(x=balance, fill=student, color=student)) + geom_histogram(alpha=0.6) + xlab("Balance") + ylab("Customer Count") + ggtitle("Balance by Student Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = input_data, aes(x=income, fill=student, color=student)) + geom_histogram(alpha=0.6) + xlab("Income") + ylab("Customer Count") + ggtitle("Income by Student Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The correlation statistics reveal the degree of associations between variables in the data set.
It appears that there is not much association between independent variables balance and income. A correlation coefficient value of -0.15 and a p-value of 2.2e-16 indicate a very weak correlation between the two variables.
cor(input_data[,c(3,4)])
## balance income
## balance 1.0000000 -0.1522434
## income -0.1522434 1.0000000
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
attach(input_data[,c(3,4)])
cor.test(balance,income)
##
## Pearson's product-moment correlation
##
## data: balance and income
## t = -15.402, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1713322 -0.1330403
## sample estimates:
## cor
## -0.1522434
Variables default and student are categorical character variables. Since in order for R’s logistic regression to work correctly, all categorical variables much be explicitly converted into factors.
input_data$default = as.factor(input_data$default)
input_data$student = as.factor(input_data$student)
str(input_data)
## tibble [10,000 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num [1:10000] 730 817 1074 529 786 ...
## $ income : num [1:10000] 44362 12106 31767 35704 38463 ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_skip(),
## .. default = col_character(),
## .. student = col_character(),
## .. balance = col_number(),
## .. income = col_number()
## .. )
xtabs(~default + student, data=input_data)
## student
## default No Yes
## No 6850 2817
## Yes 206 127
This is the first stepwise iteration of the regression model. This step will reveal the model statistics such as the R-squared value, the Akaike information criterion (AIC), and the P-value of the logistic regression. The R-squared value indicates the percentage of the actual data points that could be explained by the regression. The AIC value is a maximum likelihood estimator for the logistic regression model. The smaller the AIC value the better fitted the model. The p-value of each independent variable indicates the pairwise significance of the association between each independent variable and the dependent variable.
A McFadden R-squared value of 0.46 means that the regression can explain 46% of the actual data points. Given a range between 0 and 1, an R-Squared value of 0.46 indicates a moderately fitted regression model. Given a cut-off point of 0.05, the p-value 1 of independent variable income is not significant at all. However, the p-value 0.0 of the independent variables “balance” is very significant. The p-value 0.001 of the dummy variable “studentYes” is moderately significant. Also, given a variance inflation factor cut-off point of 5, all three independent variables, student, balance, and income have very little multicollinearity. It is because they all have variance inflation factor values smaller than 3. The AIC value of 1579.5 will be compared with the AIC value in the next stepwise iteration to find the best-fitted regression model.
rg_output <- glm(default~., data=input_data, family=binomial(link="logit"))
summary(rg_output)
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"),
## data = input_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
#log-likelihood of the null model
ll.null <- rg_output$null.deviance/-2
#log-likelihood of the fancy model
ll.proposed <- rg_output$deviance/-2
r2 <- (ll.null-ll.proposed)/ll.null
print(r2)
## [1] 0.4619194
pv <- 1-pchisq(2*(ll.proposed-ll.null),df=(length(rg_output$coefficients)-1))
print(pv)
## [1] 0
library(car)
## Loading required package: carData
# Variance Inflation Factors (>5?)
vif(rg_output)
## student balance income
## 2.761184 1.076946 2.674562
This is the second stepwise iteration of the regression model. This step will reveal the best fitted model.
A McFadden R-squared value of 0.46 means that the regression can explain 46% of the actual data points. Given a range between 0 and 1, an R-Squared value of 0.46 indicates a moderately fitted regression model. In order to have a better fitted regression model, it is suggested that more data records are required. 10,000 data records are not enough. Given a cut-off point of 0.05, both the p-values 0.0 of the independent variable “balance” and dummy variable “studentYes” are very significant. Given a variance inflation factor cut-off point of 5, both independent variables, balance, and student have no multicollinearity. It is because they both have variance inflation factor values close to 1. The AIC value of 1577.7 is smaller than that of the previous stepwise iteration, which means a better-fitted regression model.
rg_output = glm(default~balance+student, data=input_data, family=binomial(link="logit"))
summary(rg_output)
##
## Call:
## glm(formula = default ~ balance + student, family = binomial(link = "logit"),
## data = input_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4578 -0.1422 -0.0559 -0.0203 3.7435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
## balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
#log-likelihood of the null model
ll.null <- rg_output$null.deviance/-2
#log-likelihood of the fancy model
ll.proposed <- rg_output$deviance/-2
r2 <- (ll.null-ll.proposed)/ll.null
print(r2)
## [1] 0.4618726
pv <- 1-pchisq(2*(ll.proposed-ll.null),df=(length(rg_output$coefficients)-1))
print(pv)
## [1] 0
library(car)
# Variance Inflation Factors (>5?)
vif(rg_output)
## balance student
## 1.076697 1.076697
This is the automation of stepwise iterations.
The resulting suggested model reveals the same statistics as that from the second manual stepwise iteration above. Therefore, It is evident that the suggested model is the most parsimonious logistic regression model for this data frame.
m_empty = glm(default~1, data=input_data, family=binomial(link="logit"))
m_all = glm(default~., data=input_data, family=binomial(link="logit"))
step(m_empty, direction = "both", scope=formula(m_all))
## Start: AIC=2922.65
## default ~ 1
##
## Df Deviance AIC
## + balance 1 1596.5 1600.5
## + student 1 2908.7 2912.7
## + income 1 2916.7 2920.7
## <none> 2920.7 2922.7
##
## Step: AIC=1600.45
## default ~ balance
##
## Df Deviance AIC
## + student 1 1571.7 1577.7
## + income 1 1579.0 1585.0
## <none> 1596.5 1600.5
## - balance 1 2920.7 2922.7
##
## Step: AIC=1577.68
## default ~ balance + student
##
## Df Deviance AIC
## <none> 1571.7 1577.7
## + income 1 1571.5 1579.5
## - student 1 1596.5 1600.5
## - balance 1 2908.7 2912.7
##
## Call: glm(formula = default ~ balance + student, family = binomial(link = "logit"),
## data = input_data)
##
## Coefficients:
## (Intercept) balance studentYes
## -10.749496 0.005738 -0.714878
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9997 Residual
## Null Deviance: 2921
## Residual Deviance: 1572 AIC: 1578
Y = ln(p/(1-p)) = b0 + b1(x1) + b2(x2)
Y = -10.749496 + 0.005738(x1) - 0.714878(x2)
p = default = (e^(bo+b1(x1)+b2(x2))) / (e^(bo+b1(x1)+b2(x2))+1)
b0 = Y-intercept = -10.749496
b1 = Coefficient 1 = 0.005738
b2 = Coefficient 2 = -0.714878
x1 = balance
x2 = studentYes
The Y-intercept or Coefficient b0 for this equation is -10.749496. If independent variables X1 and X2 all equal to zeros, then the log-odds Y is equal to -10.749496. Coefficient b1 has a positive sign, which means that for any increase in the value of X1, there will be a corresponding increase in the value of log-odds Y. Coefficient b2 has a negative sign, which means that for any increase in the value of X2, there will be a corresponding decrease in the value of log-odds Y. If compared between b1 and b2, Coefficient b2 is the largest in magnitude. It is almost 125 times the magnitude of b1. This means if X1 is constant, an increase in X2 by 1 will correspond to a decrease in the value of log-odds Y by 0.714878. In terms of the probability p, if x2 is 0 and x1 is increased by 1900, p is increased by e^(bo+b1(x1)+b2(x2)) / (e^(bo+b1(x1)+b2(x2))+1) = e^(-10.749496+0.005738(1900)) / (e^(-10.749496+0.005738(1900))+1) = 0.538101988620444 or 54%. However, if x2 is 1 and x1 is increased by 1900, p is increased by e^(bo+b1(x1)+b2(x2)) / (e^(bo+b1(x1)+b2(x2))+1) = e^(-10.749496+0.005738(1900)-0.714878(1) ) / (e^(-10.749496+0.005738(1900)-0.714878(1))+1) = 0.363044587441418 or 36%. Therefore, for the same increase in balance, a non-student customer is more likely to default than a student customer.