Specific requirements

In this programming project, complete the following steps:

Data frame information

Data set name:
Credit Card Default Data
Description:
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
Usage:
Default
Format:
A data frame with 10000 observations on the following 4 variables.
default
          A factor with levels No and Yes indicating whether the customer defaulted on their debt.
balance
          The average balance that the customer has remaining on their credit card after making their monthly payment.
income
          Income of customer.
Source:
Simulated data
References:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, http://www.StatLearning.com Springer-Verlag, New York
Examples:
summary(Default)
glm(default~student+balance+income,family=“binomial”,data=Default)
Data download page:
http://vincentarelbundock.github.io/Rdatasets/datasets.html

Include the knitr package for integration of R code into Markdown

knitr::opts_chunk$set(echo = TRUE)

Import data

library(readr)
input_data <- read_csv("Default.csv", col_types = cols(X1 = col_skip(), 
    balance = col_number(), default = col_character(), 
    income = col_number(), student = col_character()))
## Warning: Missing column names filled in: 'X1' [1]

Descriptive statistics

Dimension of data frame

dim(input_data)
## [1] 10000     4

Structure of data frame

str(input_data)
## tibble [10,000 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ default: chr [1:10000] "No" "No" "No" "No" ...
##  $ student: chr [1:10000] "No" "Yes" "No" "No" ...
##  $ balance: num [1:10000] 730 817 1074 529 786 ...
##  $ income : num [1:10000] 44362 12106 31767 35704 38463 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_skip(),
##   ..   default = col_character(),
##   ..   student = col_character(),
##   ..   balance = col_number(),
##   ..   income = col_number()
##   .. )

Summary statistics of data frame

summary(input_data)
##    default            student             balance           income     
##  Length:10000       Length:10000       Min.   :   0.0   Min.   :  772  
##  Class :character   Class :character   1st Qu.: 481.7   1st Qu.:21340  
##  Mode  :character   Mode  :character   Median : 823.6   Median :34553  
##                                        Mean   : 835.4   Mean   :33517  
##                                        3rd Qu.:1166.3   3rd Qu.:43808  
##                                        Max.   :2654.3   Max.   :73554

Head of data frame

head(input_data)
## # A tibble: 6 x 4
##   default student balance income
##   <chr>   <chr>     <dbl>  <dbl>
## 1 No      No         730. 44362.
## 2 No      Yes        817. 12106.
## 3 No      No        1074. 31767.
## 4 No      No         529. 35704.
## 5 No      No         786. 38463.
## 6 No      Yes        920.  7492.

Tail of data frame

tail(input_data)
## # A tibble: 6 x 4
##   default student balance income
##   <chr>   <chr>     <dbl>  <dbl>
## 1 No      Yes        172. 14956.
## 2 No      No         712. 52992.
## 3 No      No         758. 19661.
## 4 No      No         845. 58636.
## 5 No      No        1569. 36669.
## 6 No      Yes        201. 16863.

Variable data types

sapply(input_data,mode)
##     default     student     balance      income 
## "character" "character"   "numeric"   "numeric"

Mean of each variable

lapply(input_data[,c(3,4)],mean)
## $balance
## [1] 835.3749
## 
## $income
## [1] 33516.98

Median of each variable

lapply(input_data[,c(3,4)],median)
## $balance
## [1] 823.637
## 
## $income
## [1] 34552.64

Mode of each variable

require(modeest)
## Loading required package: modeest
lapply(input_data[,3],mfv)
## $balance
## [1] 0
# Variable $Income has no mode. Each value repeats only once.

Minimum value of each variable

lapply(input_data[,c(3,4)],min)
## $balance
## [1] 0
## 
## $income
## [1] 771.9677

Maximum value of each variable

lapply(input_data[,c(3,4)],max)
## $balance
## [1] 2654.323
## 
## $income
## [1] 73554.23

Range of each variable

lapply(input_data[,c(3,4)],range)
## $balance
## [1]    0.000 2654.323
## 
## $income
## [1]   771.9677 73554.2335

Variance of each variable

lapply(input_data[,c(3,4)],var)
## $balance
## [1] 233980.2
## 
## $income
## [1] 177865955

Standard deviation of each variable

lapply(input_data[,c(3,4)],sd)
## $balance
## [1] 483.715
## 
## $income
## [1] 13336.64

Median absolute deviation of each variable

lapply(input_data[,c(3,4)],mad)
## $balance
## [1] 507.5173
## 
## $income
## [1] 16350.86

Plots and Graphs

Box plots

Explanation:

These box plots reveal the mean value, minimum value, and maximum value of each variable.

Observation:

Customers with higher average income are less likely to default on their credit cards. Customers with lower average balance are less likely to default on their credit cards. Students have higher average balances than non-student customers. Students have lower average incomes than non-student customers.

library(ggplot2)
ggplot(data = input_data, aes(x=, y=balance, fill=default)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Balance") + ggtitle("Balance by Default Status")

library(ggplot2)
ggplot(data = input_data, aes(x=, y=income, fill=default)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Income") + ggtitle("Income by Default Status")

ggplot(data = input_data, aes(x=, y=balance, fill=student)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Balance") + ggtitle("Balance by Student Status")

library(ggplot2)
ggplot(data = input_data, aes(x=, y=income, fill=student)) + geom_boxplot(outlier.colour = "red") + xlab("") + ylab("Income") + ggtitle("Income by Student Status")

Histograms

Explanation:

These histogram plots reveal the relationships between customer student status, default status, income and balance.

Observation:

Majority of customers who default on their credit cards have balance of about $1,800. Majority of customers who default on their credit cards have income of about $19,000. Majority of the customers are non-students. Students have much lower balances than non-student customers. Students have much lower incomes than non-student customers.

library(ggplot2)
ggplot(data = input_data, aes(x=balance, fill=default, color=default)) + geom_histogram(alpha=0.6)+ xlab("Balance") + ylab("Customer Count") + ggtitle("Balance by Default Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = input_data, aes(x=income, fill=default, color=default)) + geom_histogram(alpha=0.6) + xlab("Income") + ylab("Customer Count") + ggtitle("Income by Default Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = input_data, aes(x=balance, fill=student, color=student)) + geom_histogram(alpha=0.6) + xlab("Balance") + ylab("Customer Count") + ggtitle("Balance by Student Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = input_data, aes(x=income, fill=student, color=student)) + geom_histogram(alpha=0.6) + xlab("Income") + ylab("Customer Count") + ggtitle("Income by Student Status")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Regression Model Fitting

Correlation Statistics

Explanation:

The correlation statistics reveal the degree of associations between variables in the data set.

Observation:

It appears that there is not much association between independent variables balance and income. A correlation coefficient value of -0.15 and a p-value of 2.2e-16 indicate a very weak correlation between the two variables.

Coefficient R:

cor(input_data[,c(3,4)])
##            balance     income
## balance  1.0000000 -0.1522434
## income  -0.1522434  1.0000000

P-Value:

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
attach(input_data[,c(3,4)])
cor.test(balance,income)
## 
##  Pearson's product-moment correlation
## 
## data:  balance and income
## t = -15.402, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1713322 -0.1330403
## sample estimates:
##        cor 
## -0.1522434

Converting categorical variables into factors

Explanation:

Variables default and student are categorical character variables. Since in order for R’s logistic regression to work correctly, all categorical variables much be explicitly converted into factors.

input_data$default = as.factor(input_data$default)
input_data$student = as.factor(input_data$student)
str(input_data)
## tibble [10,000 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num [1:10000] 730 817 1074 529 786 ...
##  $ income : num [1:10000] 44362 12106 31767 35704 38463 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_skip(),
##   ..   default = col_character(),
##   ..   student = col_character(),
##   ..   balance = col_number(),
##   ..   income = col_number()
##   .. )

Relationship between categorical variables:

xtabs(~default + student, data=input_data)
##        student
## default   No  Yes
##     No  6850 2817
##     Yes  206  127

1st Stepwise Regression Model Iteration

Explanation:

This is the first stepwise iteration of the regression model. This step will reveal the model statistics such as the R-squared value, the Akaike information criterion (AIC), and the P-value of the logistic regression. The R-squared value indicates the percentage of the actual data points that could be explained by the regression. The AIC value is a maximum likelihood estimator for the logistic regression model. The smaller the AIC value the better fitted the model. The p-value of each independent variable indicates the pairwise significance of the association between each independent variable and the dependent variable.

Observation:

A McFadden R-squared value of 0.46 means that the regression can explain 46% of the actual data points. Given a range between 0 and 1, an R-Squared value of 0.46 indicates a moderately fitted regression model. Given a cut-off point of 0.05, the p-value 1 of independent variable income is not significant at all. However, the p-value 0.0 of the independent variables “balance” is very significant. The p-value 0.001 of the dummy variable “studentYes” is moderately significant. Also, given a variance inflation factor cut-off point of 5, all three independent variables, student, balance, and income have very little multicollinearity. It is because they all have variance inflation factor values smaller than 3. The AIC value of 1579.5 will be compared with the AIC value in the next stepwise iteration to find the best-fitted regression model.

Stepwise:

rg_output <- glm(default~., data=input_data, family=binomial(link="logit"))
summary(rg_output)
## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = input_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

McFadden R-Squared:

#log-likelihood of the null model
ll.null <- rg_output$null.deviance/-2
#log-likelihood of the fancy model
ll.proposed <- rg_output$deviance/-2
r2 <- (ll.null-ll.proposed)/ll.null
print(r2)
## [1] 0.4619194

Model P-Value:

pv <- 1-pchisq(2*(ll.proposed-ll.null),df=(length(rg_output$coefficients)-1))
print(pv)
## [1] 0

Multicollinearity and Stepwise:

library(car)
## Loading required package: carData
# Variance Inflation Factors (>5?)
vif(rg_output)
##  student  balance   income 
## 2.761184 1.076946 2.674562

2nd Stepwise Regression Model Iteration

Explanation:

This is the second stepwise iteration of the regression model. This step will reveal the best fitted model.

Observation:

A McFadden R-squared value of 0.46 means that the regression can explain 46% of the actual data points. Given a range between 0 and 1, an R-Squared value of 0.46 indicates a moderately fitted regression model. In order to have a better fitted regression model, it is suggested that more data records are required. 10,000 data records are not enough. Given a cut-off point of 0.05, both the p-values 0.0 of the independent variable “balance” and dummy variable “studentYes” are very significant. Given a variance inflation factor cut-off point of 5, both independent variables, balance, and student have no multicollinearity. It is because they both have variance inflation factor values close to 1. The AIC value of 1577.7 is smaller than that of the previous stepwise iteration, which means a better-fitted regression model.

Stepwise:

rg_output = glm(default~balance+student, data=input_data, family=binomial(link="logit"))
summary(rg_output)
## 
## Call:
## glm(formula = default ~ balance + student, family = binomial(link = "logit"), 
##     data = input_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

McFadden R-Squared:

#log-likelihood of the null model
ll.null <- rg_output$null.deviance/-2
#log-likelihood of the fancy model
ll.proposed <- rg_output$deviance/-2
r2 <- (ll.null-ll.proposed)/ll.null
print(r2)
## [1] 0.4618726

Model P-Value:

pv <- 1-pchisq(2*(ll.proposed-ll.null),df=(length(rg_output$coefficients)-1))
print(pv)
## [1] 0

Multicollinearity:

library(car)
# Variance Inflation Factors (>5?)
vif(rg_output)
##  balance  student 
## 1.076697 1.076697

Automatic Stepwise Testing

Explanation:

This is the automation of stepwise iterations.

Observation:

The resulting suggested model reveals the same statistics as that from the second manual stepwise iteration above. Therefore, It is evident that the suggested model is the most parsimonious logistic regression model for this data frame.

m_empty = glm(default~1, data=input_data, family=binomial(link="logit"))
m_all = glm(default~., data=input_data, family=binomial(link="logit"))
step(m_empty, direction = "both", scope=formula(m_all))
## Start:  AIC=2922.65
## default ~ 1
## 
##           Df Deviance    AIC
## + balance  1   1596.5 1600.5
## + student  1   2908.7 2912.7
## + income   1   2916.7 2920.7
## <none>         2920.7 2922.7
## 
## Step:  AIC=1600.45
## default ~ balance
## 
##           Df Deviance    AIC
## + student  1   1571.7 1577.7
## + income   1   1579.0 1585.0
## <none>         1596.5 1600.5
## - balance  1   2920.7 2922.7
## 
## Step:  AIC=1577.68
## default ~ balance + student
## 
##           Df Deviance    AIC
## <none>         1571.7 1577.7
## + income   1   1571.5 1579.5
## - student  1   1596.5 1600.5
## - balance  1   2908.7 2912.7
## 
## Call:  glm(formula = default ~ balance + student, family = binomial(link = "logit"), 
##     data = input_data)
## 
## Coefficients:
## (Intercept)      balance   studentYes  
##  -10.749496     0.005738    -0.714878  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
## Null Deviance:       2921 
## Residual Deviance: 1572  AIC: 1578

Final Regression Equation:

Y = ln(p/(1-p)) = b0 + b1(x1) + b2(x2)
Y = -10.749496 + 0.005738(x1) - 0.714878(x2)
p = default = (e^(bo+b1(x1)+b2(x2))) / (e^(bo+b1(x1)+b2(x2))+1)
b0 = Y-intercept = -10.749496
b1 = Coefficient 1 = 0.005738
b2 = Coefficient 2 = -0.714878
x1 = balance
x2 = studentYes

Regression Equation Interpretation:

The Y-intercept or Coefficient b0 for this equation is -10.749496. If independent variables X1 and X2 all equal to zeros, then the log-odds Y is equal to -10.749496. Coefficient b1 has a positive sign, which means that for any increase in the value of X1, there will be a corresponding increase in the value of log-odds Y. Coefficient b2 has a negative sign, which means that for any increase in the value of X2, there will be a corresponding decrease in the value of log-odds Y. If compared between b1 and b2, Coefficient b2 is the largest in magnitude. It is almost 125 times the magnitude of b1. This means if X1 is constant, an increase in X2 by 1 will correspond to a decrease in the value of log-odds Y by 0.714878. In terms of the probability p, if x2 is 0 and x1 is increased by 1900, p is increased by e^(bo+b1(x1)+b2(x2)) / (e^(bo+b1(x1)+b2(x2))+1) = e^(-10.749496+0.005738(1900)) / (e^(-10.749496+0.005738(1900))+1) = 0.538101988620444 or 54%. However, if x2 is 1 and x1 is increased by 1900, p is increased by e^(bo+b1(x1)+b2(x2)) / (e^(bo+b1(x1)+b2(x2))+1) = e^(-10.749496+0.005738(1900)-0.714878(1) ) / (e^(-10.749496+0.005738(1900)-0.714878(1))+1) = 0.363044587441418 or 36%. Therefore, for the same increase in balance, a non-student customer is more likely to default than a student customer.