Specific requirements

In this programming assignment, complete the following steps on how to implement one of the classification techniques to classify a data set:

  • Select a data set and load it into R (use your own data, or see the R data repository: http://vincentarelbundock.github.io/Rdatasets/datasets.html).
  • Run a logistic regression model on the data and variables of your own choosing.
  • Then run the logistic regression model within the framework of cross-validation.
  • Interpret and discuss the results.
  • Submit a *.html file generated by R Markdown.

Data frame information

Name

Athletic Participation, Race, and Graduation.

Description

Six-year graduation data for 214,555 students in 2004.

Format

A data frame with 214,555 observations on the following 3 variables.

Student

Athlete or NonAthlete

Race

Black or White

Grad

1=graduated within 6 years, otherwise 0

Details

Six-year graduation data from 2004 for male non-athletes and for male athletes, where “Athlete”" means football or basketball player. These data show Simpson’s Paradox.

Source

Victor Matheson, College of the Holy Cross, collected the summary statistics.

Data are derived from the summary tables in:

Matheson, V., “Research Note: Athletic Graduation Rates and Simpson’s Paradox,” Economics of Education Review, Vol. 26:4 (August 2007), 516-520.

Include the knitr package for integration of R code into Markdown

knitr::opts_chunk$set(echo = TRUE)

Import data

oldw <- getOption("warn")
options(warn = -1)

library(readr)
input_data <- read_csv("AthleteGrad.csv", 
    col_types = cols(
      X1 = col_skip(),
      Student = col_character(), 
      Race = col_character(), 
      Grad = col_character()
))
options(warn = oldw)

Descriptive statistics

Dimension of data frame

dim(input_data)
## [1] 214555      3

Structure of data frame

str(input_data)
## tibble [214,555 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Student: chr [1:214555] "Athlete" "Athlete" "Athlete" "Athlete" ...
##  $ Race   : chr [1:214555] "White" "White" "White" "White" ...
##  $ Grad   : chr [1:214555] "1" "1" "1" "1" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_skip(),
##   ..   Student = col_character(),
##   ..   Race = col_character(),
##   ..   Grad = col_character()
##   .. )

Summary statistics of data frame

summary(input_data)
##    Student              Race               Grad          
##  Length:214555      Length:214555      Length:214555     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Glimpse of data frame

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(input_data)
## Rows: 214,555
## Columns: 3
## $ Student <chr> "Athlete", "Athlete", "Athlete", "Athlete", "Athlete", "Ath...
## $ Race    <chr> "White", "White", "White", "White", "White", "White", "Whit...
## $ Grad    <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",...

Variable data types

sapply(input_data,mode)
##     Student        Race        Grad 
## "character" "character" "character"

Head of data frame

head(input_data)
## # A tibble: 6 x 3
##   Student Race  Grad 
##   <chr>   <chr> <chr>
## 1 Athlete White 1    
## 2 Athlete White 1    
## 3 Athlete White 1    
## 4 Athlete White 1    
## 5 Athlete White 1    
## 6 Athlete White 1

Tail of data frame

tail(input_data)
## # A tibble: 6 x 3
##   Student    Race  Grad 
##   <chr>      <chr> <chr>
## 1 NonAthlete Black 0    
## 2 NonAthlete Black 0    
## 3 NonAthlete Black 0    
## 4 NonAthlete Black 0    
## 5 NonAthlete Black 0    
## 6 NonAthlete Black 0

Mode of each variable

require(modeest)
## Loading required package: modeest
lapply(input_data,mfv)
## $Student
## [1] "NonAthlete"
## 
## $Race
## [1] "White"
## 
## $Grad
## [1] "1"

Plots

Explanation

These plots reveal the level of activities under each variable.

Observation

It appears that there are more graduates than none graduates, very little athletes, more white graduates than black graduates, and the majority of the students are racially white. Simpson’s Paradox exists in this data because it appears that being racially white at the school has a higher rate of graduation than being racially black, but in reality it is not the case. When going into the detailed ratios of athletic black students versus non-athletic black students, and athletic white students versus non-athletic white students, the ratios show that more black students are athletes within their racial group. Athletes in general are less likely to graduate on time. Therefore, the low racial graduation rate for black students has nothing to do with their race, but rather because they have a much larger athletic versus non-athletic ratio.

input_data$Grad = as.factor(input_data$Grad)
input_data$Student = as.factor(input_data$Student)
input_data$Race = as.factor(input_data$Race)

tb1<-table(Grad=input_data$Grad,Athlete=input_data$Student)
tb2<-table(Grad=input_data$Grad,Race=input_data$Race)
tb3<-table(Athlete=input_data$Student,Race=input_data$Race)

barplot(tb1, beside=T, col=c("orange","blue"), main = "Complex Bar Plot")
legend(x="topleft", legend=c("Non-Graduate","Graduate"),text.col=c("orange","blue"))

barplot(tb2, beside=T, col=c("orange","blue"), main = "Complex Bar Plot")
legend(x="topleft", legend=c("Non-Graduate","Graduate"),text.col=c("orange","blue"))

barplot(tb3, beside=T, col=c("orange","blue"), main = "Complex Bar Plot")
legend(x="topleft", legend=c("Athlete","Non-Athlete"),text.col=c("orange","blue"))

plot(input_data$Student, beside=T, col=c("red","green"), main = "Simple Bar Plot")

plot(input_data$Race, beside=T, col=c("purple","yellow"), main = "Simple Bar Plot")

hist(as.numeric(with(input_data,ifelse(input_data$Student=="Athlete","1","0"))), main = "Histogram", xlab="Non-Athlete/Athlete", col=c("blue"))

hist(as.numeric(with(input_data,ifelse(input_data$Race=="White","1","0"))), main = "Histogram", xlab="Black/White", col=c("blue"))

hist(as.numeric(with(input_data,ifelse(input_data$Grad=="1","1","0"))), main = "Histogram", xlab="Non-Graduate/Graduate", col=c("blue"))

Data Processing

Replacing missing values with zeros

Explanation

These missing values could cause inaccuracies or errors when calculating data limits, central tendency, dispersion tendency, correlation, multicollinearity, p-values, z-scores, variance inflation factors, etc.

oldw <- getOption("warn")
options(warn = -1)
library(tidyr)
library(dplyr)
input_data <- input_data%>% mutate_all(funs(replace_na(.,"0")))
glimpse(input_data)
## Rows: 214,555
## Columns: 3
## $ Student <fct> Athlete, Athlete, Athlete, Athlete, Athlete, Athlete, Athle...
## $ Race    <fct> White, White, White, White, White, White, White, White, Whi...
## $ Grad    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
options(warn = oldw)

Data set preparation

Explanation

To ensure that R’s data science models work correctly, all categorical dependent variables must be explicitly converted into factors. As for the independent variables, if the variable is both categorical and has more than two levels, then it should be converted into a factor.

library(questionr)
input_data <- rename.variable(input_data,"Student","Athlete")
input_data$Athlete = as.numeric(with(input_data,ifelse(input_data$Athlete=="Athlete","1","0")))
input_data$Race = as.numeric(with(input_data,ifelse(input_data$Race=="White","1","0")))
input_data$Grad = as.factor(input_data$Grad)

Sorting data set

Explanation

Useful for examinating the order of data values.

library(dplyr)
input_data <- input_data[order(input_data$Athlete,input_data$Race),] 
glimpse(input_data)
## Rows: 214,555
## Columns: 3
## $ Athlete <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Race    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Grad    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
input_data <- input_data[order(-input_data$Athlete,-input_data$Race),] 
glimpse(input_data)
## Rows: 214,555
## Columns: 3
## $ Athlete <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Race    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Grad    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

Confounding Variable Analysis

Explanation

Confounding happens when the collinearity between two or more independent variables creates a false causal relationship between the independent variable(s) and the dependent variable. It is important to eliminate these false causations because they could cause the regression model to make bias predictions.

Observation

From the Chi-squared Test result below, it is evidenced that all independent variables in this data set are confounders. Because the Chi-squared null hypothesis could easily be rejected.

Chi-squared Test of Independence

Explanation

This is to test if two variables have no association with each other. The null hypothesis is that variable A is not associated with variable B. Given a significance level or alpha of 0.05 if the p-value is less than this threshold, then the null hypothesis could be rejected and the alternative hypothesis that the two variables are indeed associated with each other has to be accepted.

oldw <- getOption("warn")
options(warn = -1)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
tbl0 = table(input_data$Athlete, input_data$Grad) 
chisq.test(tbl0)$p.value
## [1] 1.63653e-05
tbl0a = table(input_data$Race, input_data$Grad) 
chisq.test(tbl0a)$p.value
## [1] 0
tbl0b = table(input_data$Athlete, input_data$Race) 
chisq.test(tbl0b)$p.value
## [1] 0
options(warn = oldw)

Binary Logistic Regression Assumptions

Linearity and Log odds

  • There must be a linear relationship between the log odds of the dependent variable and the independent variables. However, there should be no linear relationship between the independent variables and the odd ratio or the outcome probability.

Multicollinearity

  • Independent variables should not be too highly correlated with each other.

Collinearity

  • There must be some degree of collinearity between variables.

Variable Type

  • The dependent variable must be categorical.
  • The independent variables could be either continuous or categorical.

Outcome Value

  • The outcome values of the dependent variable must be dichotomous.

Independent Observations

  • The observations must be independent of each other. The observations should not come from repeated measurements or matched data. If observations are related to one another, then the model will tend to overweight the significance of those observations.

Automatic Stepwise Iterations

Explanation

Provided with both the null model and the all-inclusive model, these iterations will result in the most parsimonious logistic regression model.

Observation

It took 3 stepwise iterations to find the most parsimonious logistic regression model for this data frame.

oldw <- getOption("warn")
options(warn = -1)
m_empty = glm(Grad~1, data=input_data, family=binomial(link="logit"))
m_all = glm(Grad~., data=input_data, family=binomial(link="logit"))
step(m_empty, direction = "both", scope=formula(m_all))
## Start:  AIC=292896.1
## Grad ~ 1
## 
##           Df Deviance    AIC
## + Race     1   287795 287799
## + Athlete  1   292875 292879
## <none>         292894 292896
## 
## Step:  AIC=287799.2
## Grad ~ Race
## 
##           Df Deviance    AIC
## + Athlete  1   287715 287721
## <none>         287795 287799
## - Race     1   292894 292896
## 
## Step:  AIC=287720.6
## Grad ~ Race + Athlete
## 
##           Df Deviance    AIC
## <none>         287715 287721
## - Athlete  1   287795 287799
## - Race     1   292875 292879
## 
## Call:  glm(formula = Grad ~ Race + Athlete, family = binomial(link = "logit"), 
##     data = input_data)
## 
## Coefficients:
## (Intercept)         Race      Athlete  
##     -0.6027       1.0063       0.3080  
## 
## Degrees of Freedom: 214554 Total (i.e. Null);  214552 Residual
## Null Deviance:       292900 
## Residual Deviance: 287700    AIC: 287700
options(warn = oldw)

Manual Stepwise Iteration

Explanation

This is the manual stepwise iteration of the regression model. This iteration will reveal the model statistics such as the R-squared value, the Akaike information criterion (AIC), and the P-value of the logistic regression. The R-squared value indicates the percentage of the actual data points that could be explained by the regression. The AIC value is a maximum likelihood estimator for the logistic regression model. The smaller the AIC value the better fitted the model. The p-value of each independent variable indicates the pairwise significance of the association between each independent variable and the dependent variable.

Observation

Due to the existence of confounders in the data set, some model statistics such as the McFadden R-squared value could not be properly calculated by the R Studio. The two confounding variables ‘Athlete’ and ‘Race’ could not be removed during the model iterations because they both are the control variables. In other words, these two independent variables are the main drivers of the dependent variable. However, the other model statistics such as the model p-value of 0 and the variance inflation factors close to 1 indicate that this iteration is both highly significant and stable.

oldw <- getOption("warn")
options(warn = -1)
lg_out_2 <- glm(Grad ~ Athlete+Race, data=input_data, family=binomial(link="logit"))
summary(lg_out_2) 
## 
## Call:
## glm(formula = Grad ~ Athlete + Race, family = binomial(link = "logit"), 
##     data = input_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.491  -1.353   1.012   1.012   1.442  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.60267    0.01364 -44.183   <2e-16 ***
## Athlete      0.30802    0.03440   8.953   <2e-16 ***
## Race         1.00628    0.01436  70.061   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292894  on 214554  degrees of freedom
## Residual deviance: 287715  on 214552  degrees of freedom
## AIC: 287721
## 
## Number of Fisher Scoring iterations: 4
# Coefficient Confident Intervals
confint.default(lg_out_2)
##                  2.5 %     97.5 %
## (Intercept) -0.6294006 -0.5759315
## Athlete      0.2405927  0.3754505
## Race         0.9781335  1.0344351
# Odd Ratios
exp(coef(lg_out_2))
## (Intercept)     Athlete        Race 
##   0.5473504   1.3607304   2.7354181
options(warn = oldw)

McFadden R-Squared

library(DescTools)
r2 <- PseudoR2(lg_out_2, which = "McFadden")
r2_adj <- PseudoR2(lg_out_2, which = "McFaddenAdj")
print(r2)
##  McFadden 
## 0.0176838
print(r2_adj)
## McFaddenAdj 
##  0.01766331

Model P-Value

# log-likelihood of the null model
ll.null <- lg_out_2$null.deviance/-2
# log-likelihood of the fancy model
ll.proposed <- lg_out_2$deviance/-2
pv <- 1-pchisq(2*(ll.proposed-ll.null),df=(length(lg_out_2$coefficients)-1))
print(pv)
## [1] 0

Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:DescTools':
## 
##     Recode
## The following object is masked from 'package:dplyr':
## 
##     recode
# Variance Inflation Factors (>5?)
vif(lg_out_2)
##  Athlete     Race 
## 1.039315 1.039315

Cross-Validation

Explanation

The purpose of cross validation is to ensure that every single data record of the data set is used for both training and testing the data science model. With the decided number of folds, this analysis will yield the most accurate trained data science model. The most parsimonious logistic regression iteration identified by the Automatic Stepwise Iteration above will be used for this analysis.

Observation

With 10 repeats of 20 folds, the resulting cross-validation has an observed accuracy percentage of 0.6048, a sensitivity percentage of 0.9278, and a specificity percentage of 0.1719, which indicates a moderately significant model. Also, the P-Value of less than 2.2e-16 does show that the model is statistically significant. An accuracy percentage of 0.6048 means that the model can predict 61% of both the true negatives and true positives. A sensitivity of 0.9278 means that the model can predict 93% of the true positives. A specificity percentage of 0.1719 means that the model can predict 17% of the true negatives.

oldw <- getOption("warn")
options(warn = -1)

library(FSelector)
library(rpart)
library(rpart.plot)
library(data.tree)
## 
## Attaching package: 'data.tree'
## The following object is masked from 'package:DescTools':
## 
##     Sort
library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE
library(dplyr)
library(doSNOW)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: snow
str(input_data)
## tibble [214,555 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Athlete: num [1:214555] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Race   : num [1:214555] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Grad   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
summary(input_data$Grad)
##      0      1 
##  91696 122859
# Create a stratified index of 200 folds (10 repeats of 20 folds)
set.seed(123)
rand_folds <- createMultiFolds(input_data$Grad, k=20, times=10)
summary(rand_folds)
##              Length Class  Mode   
## Fold01.Rep01 203828 -none- numeric
## Fold02.Rep01 203827 -none- numeric
## Fold03.Rep01 203827 -none- numeric
## Fold04.Rep01 203827 -none- numeric
## Fold05.Rep01 203827 -none- numeric
## Fold06.Rep01 203827 -none- numeric
## Fold07.Rep01 203827 -none- numeric
## Fold08.Rep01 203828 -none- numeric
## Fold09.Rep01 203827 -none- numeric
## Fold10.Rep01 203827 -none- numeric
## Fold11.Rep01 203828 -none- numeric
## Fold12.Rep01 203827 -none- numeric
## Fold13.Rep01 203828 -none- numeric
## Fold14.Rep01 203827 -none- numeric
## Fold15.Rep01 203827 -none- numeric
## Fold16.Rep01 203828 -none- numeric
## Fold17.Rep01 203827 -none- numeric
## Fold18.Rep01 203827 -none- numeric
## Fold19.Rep01 203827 -none- numeric
## Fold20.Rep01 203827 -none- numeric
## Fold01.Rep02 203827 -none- numeric
## Fold02.Rep02 203827 -none- numeric
## Fold03.Rep02 203827 -none- numeric
## Fold04.Rep02 203827 -none- numeric
## Fold05.Rep02 203827 -none- numeric
## Fold06.Rep02 203828 -none- numeric
## Fold07.Rep02 203828 -none- numeric
## Fold08.Rep02 203827 -none- numeric
## Fold09.Rep02 203827 -none- numeric
## Fold10.Rep02 203827 -none- numeric
## Fold11.Rep02 203828 -none- numeric
## Fold12.Rep02 203828 -none- numeric
## Fold13.Rep02 203827 -none- numeric
## Fold14.Rep02 203827 -none- numeric
## Fold15.Rep02 203827 -none- numeric
## Fold16.Rep02 203827 -none- numeric
## Fold17.Rep02 203828 -none- numeric
## Fold18.Rep02 203827 -none- numeric
## Fold19.Rep02 203827 -none- numeric
## Fold20.Rep02 203827 -none- numeric
## Fold01.Rep03 203828 -none- numeric
## Fold02.Rep03 203827 -none- numeric
## Fold03.Rep03 203827 -none- numeric
## Fold04.Rep03 203827 -none- numeric
## Fold05.Rep03 203827 -none- numeric
## Fold06.Rep03 203827 -none- numeric
## Fold07.Rep03 203827 -none- numeric
## Fold08.Rep03 203827 -none- numeric
## Fold09.Rep03 203827 -none- numeric
## Fold10.Rep03 203827 -none- numeric
## Fold11.Rep03 203827 -none- numeric
## Fold12.Rep03 203827 -none- numeric
## Fold13.Rep03 203828 -none- numeric
## Fold14.Rep03 203828 -none- numeric
## Fold15.Rep03 203828 -none- numeric
## Fold16.Rep03 203827 -none- numeric
## Fold17.Rep03 203828 -none- numeric
## Fold18.Rep03 203827 -none- numeric
## Fold19.Rep03 203827 -none- numeric
## Fold20.Rep03 203827 -none- numeric
## Fold01.Rep04 203827 -none- numeric
## Fold02.Rep04 203828 -none- numeric
## Fold03.Rep04 203827 -none- numeric
## Fold04.Rep04 203827 -none- numeric
## Fold05.Rep04 203827 -none- numeric
## Fold06.Rep04 203827 -none- numeric
## Fold07.Rep04 203827 -none- numeric
## Fold08.Rep04 203827 -none- numeric
## Fold09.Rep04 203827 -none- numeric
## Fold10.Rep04 203827 -none- numeric
## Fold11.Rep04 203827 -none- numeric
## Fold12.Rep04 203828 -none- numeric
## Fold13.Rep04 203827 -none- numeric
## Fold14.Rep04 203827 -none- numeric
## Fold15.Rep04 203827 -none- numeric
## Fold16.Rep04 203827 -none- numeric
## Fold17.Rep04 203828 -none- numeric
## Fold18.Rep04 203827 -none- numeric
## Fold19.Rep04 203828 -none- numeric
## Fold20.Rep04 203828 -none- numeric
## Fold01.Rep05 203827 -none- numeric
## Fold02.Rep05 203827 -none- numeric
## Fold03.Rep05 203827 -none- numeric
## Fold04.Rep05 203829 -none- numeric
## Fold05.Rep05 203827 -none- numeric
## Fold06.Rep05 203827 -none- numeric
## Fold07.Rep05 203828 -none- numeric
## Fold08.Rep05 203827 -none- numeric
## Fold09.Rep05 203827 -none- numeric
## Fold10.Rep05 203827 -none- numeric
## Fold11.Rep05 203828 -none- numeric
## Fold12.Rep05 203827 -none- numeric
## Fold13.Rep05 203828 -none- numeric
## Fold14.Rep05 203827 -none- numeric
## Fold15.Rep05 203827 -none- numeric
## Fold16.Rep05 203827 -none- numeric
## Fold17.Rep05 203827 -none- numeric
## Fold18.Rep05 203827 -none- numeric
## Fold19.Rep05 203827 -none- numeric
## Fold20.Rep05 203827 -none- numeric
## Fold01.Rep06 203827 -none- numeric
## Fold02.Rep06 203827 -none- numeric
## Fold03.Rep06 203828 -none- numeric
## Fold04.Rep06 203827 -none- numeric
## Fold05.Rep06 203829 -none- numeric
## Fold06.Rep06 203827 -none- numeric
## Fold07.Rep06 203828 -none- numeric
## Fold08.Rep06 203827 -none- numeric
## Fold09.Rep06 203827 -none- numeric
## Fold10.Rep06 203827 -none- numeric
## Fold11.Rep06 203827 -none- numeric
## Fold12.Rep06 203827 -none- numeric
## Fold13.Rep06 203827 -none- numeric
## Fold14.Rep06 203827 -none- numeric
## Fold15.Rep06 203827 -none- numeric
## Fold16.Rep06 203827 -none- numeric
## Fold17.Rep06 203827 -none- numeric
## Fold18.Rep06 203827 -none- numeric
## Fold19.Rep06 203828 -none- numeric
## Fold20.Rep06 203827 -none- numeric
## Fold01.Rep07 203828 -none- numeric
## Fold02.Rep07 203827 -none- numeric
## Fold03.Rep07 203827 -none- numeric
## Fold04.Rep07 203827 -none- numeric
## Fold05.Rep07 203827 -none- numeric
## Fold06.Rep07 203828 -none- numeric
## Fold07.Rep07 203827 -none- numeric
## Fold08.Rep07 203828 -none- numeric
## Fold09.Rep07 203827 -none- numeric
## Fold10.Rep07 203827 -none- numeric
## Fold11.Rep07 203827 -none- numeric
## Fold12.Rep07 203827 -none- numeric
## Fold13.Rep07 203827 -none- numeric
## Fold14.Rep07 203827 -none- numeric
## Fold15.Rep07 203827 -none- numeric
## Fold16.Rep07 203827 -none- numeric
## Fold17.Rep07 203827 -none- numeric
## Fold18.Rep07 203827 -none- numeric
## Fold19.Rep07 203828 -none- numeric
## Fold20.Rep07 203828 -none- numeric
## Fold01.Rep08 203827 -none- numeric
## Fold02.Rep08 203827 -none- numeric
## Fold03.Rep08 203828 -none- numeric
## Fold04.Rep08 203827 -none- numeric
## Fold05.Rep08 203827 -none- numeric
## Fold06.Rep08 203827 -none- numeric
## Fold07.Rep08 203827 -none- numeric
## Fold08.Rep08 203827 -none- numeric
## Fold09.Rep08 203827 -none- numeric
## Fold10.Rep08 203827 -none- numeric
## Fold11.Rep08 203827 -none- numeric
## Fold12.Rep08 203828 -none- numeric
## Fold13.Rep08 203827 -none- numeric
## Fold14.Rep08 203828 -none- numeric
## Fold15.Rep08 203827 -none- numeric
## Fold16.Rep08 203827 -none- numeric
## Fold17.Rep08 203827 -none- numeric
## Fold18.Rep08 203829 -none- numeric
## Fold19.Rep08 203827 -none- numeric
## Fold20.Rep08 203827 -none- numeric
## Fold01.Rep09 203827 -none- numeric
## Fold02.Rep09 203827 -none- numeric
## Fold03.Rep09 203828 -none- numeric
## Fold04.Rep09 203827 -none- numeric
## Fold05.Rep09 203827 -none- numeric
## Fold06.Rep09 203827 -none- numeric
## Fold07.Rep09 203827 -none- numeric
## Fold08.Rep09 203827 -none- numeric
## Fold09.Rep09 203827 -none- numeric
## Fold10.Rep09 203828 -none- numeric
## Fold11.Rep09 203828 -none- numeric
## Fold12.Rep09 203827 -none- numeric
## Fold13.Rep09 203828 -none- numeric
## Fold14.Rep09 203827 -none- numeric
## Fold15.Rep09 203827 -none- numeric
## Fold16.Rep09 203827 -none- numeric
## Fold17.Rep09 203828 -none- numeric
## Fold18.Rep09 203827 -none- numeric
## Fold19.Rep09 203827 -none- numeric
## Fold20.Rep09 203827 -none- numeric
## Fold01.Rep10 203827 -none- numeric
## Fold02.Rep10 203827 -none- numeric
## Fold03.Rep10 203827 -none- numeric
## Fold04.Rep10 203827 -none- numeric
## Fold05.Rep10 203827 -none- numeric
## Fold06.Rep10 203827 -none- numeric
## Fold07.Rep10 203827 -none- numeric
## Fold08.Rep10 203827 -none- numeric
## Fold09.Rep10 203827 -none- numeric
## Fold10.Rep10 203827 -none- numeric
## Fold11.Rep10 203828 -none- numeric
## Fold12.Rep10 203827 -none- numeric
## Fold13.Rep10 203827 -none- numeric
## Fold14.Rep10 203827 -none- numeric
## Fold15.Rep10 203827 -none- numeric
## Fold16.Rep10 203828 -none- numeric
## Fold17.Rep10 203828 -none- numeric
## Fold18.Rep10 203827 -none- numeric
## Fold19.Rep10 203827 -none- numeric
## Fold20.Rep10 203829 -none- numeric
# Set up caret's trainControl object
set.seed(123)
trainCtrl <- trainControl(method = "repeatedcv", number=20, repeats=10, savePredictions=T, index=rand_folds)
summary(trainCtrl)
##                   Length Class  Mode     
## method              1    -none- character
## number              1    -none- numeric  
## repeats             1    -none- numeric  
## search              1    -none- character
## p                   1    -none- numeric  
## initialWindow       0    -none- NULL     
## horizon             1    -none- numeric  
## fixedWindow         1    -none- logical  
## skip                1    -none- numeric  
## verboseIter         1    -none- logical  
## returnData          1    -none- logical  
## returnResamp        1    -none- character
## savePredictions     1    -none- logical  
## classProbs          1    -none- logical  
## summaryFunction     1    -none- function 
## selectionFunction   1    -none- character
## preProcOptions      6    -none- list     
## sampling            0    -none- NULL     
## index             200    -none- list     
## indexOut            0    -none- NULL     
## indexFinal          0    -none- NULL     
## timingSamps         1    -none- numeric  
## predictionBounds    2    -none- logical  
## seeds               1    -none- logical  
## adaptive            4    -none- list     
## trim                1    -none- logical  
## allowParallel       1    -none- logical
# Set up doSNOW package for multi-core training. This will speed up the training process.
c1 <- makeCluster(2,type="SOCK")
registerDoSNOW(c1)

# Set seed for reproducibility and train model
set.seed(123)
train_obj <- train(Grad ~ Athlete+Race, data=input_data, method="glm", family="binomial", trControl=trainCtrl)
train_obj
## Generalized Linear Model 
## 
## 214555 samples
##      2 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold, repeated 10 times) 
## Summary of sample sizes: 203828, 203827, 203827, 203827, 203827, 203827, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6047913  0.1099939
summary(train_obj)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.491  -1.353   1.012   1.012   1.442  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.60267    0.01364 -44.183   <2e-16 ***
## Athlete      0.30802    0.03440   8.953   <2e-16 ***
## Race         1.00628    0.01436  70.061   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292894  on 214554  degrees of freedom
## Residual deviance: 287715  on 214552  degrees of freedom
## AIC: 287721
## 
## Number of Fisher Scoring iterations: 4
# Shutdown cluster
stopCluster(c1)

# Confusion Matrix: Comparison between the predicted values and actual observed values
confusionMatrix( table( Predicted=(train_obj$pred)$pred,Actual=(train_obj$pred)$obs ), positive="1" )
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted       0       1
##         0  157670   88650
##         1  759290 1139940
##                                           
##                Accuracy : 0.6048          
##                  95% CI : (0.6041, 0.6054)
##     No Information Rate : 0.5726          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.11            
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9278          
##             Specificity : 0.1719          
##          Pos Pred Value : 0.6002          
##          Neg Pred Value : 0.6401          
##              Prevalence : 0.5726          
##          Detection Rate : 0.5313          
##    Detection Prevalence : 0.8852          
##       Balanced Accuracy : 0.5499          
##                                           
##        'Positive' Class : 1               
## 
options(warn = oldw)

Final Regression Equation

Explanation

This is the equation resulted from the most parsimonious regression iteration.

Observation

With the existence of two confounders and no other significant independent variables, this is the most parsimonious regression iteration identified.

Y = ln(p/(1-p)) = b0 + b1(X1) + b2(X2)
Y = -0.60267 + 0.30802(x1) + 1.00628(X2)
p = Grad = (e^(b0 + b1(X1) + b2(X2)) + 1)
b0 = Y-intercept = -0.60267
b1 = Coefficient 1 = 0.30802
b2 = Coefficient 2 = 1.00628
X1 = Athlete
X2 = Race

Regression Equation Interpretation

The y-intercept or coefficient b0 for this equation is -0.60267. If independent variables X1 and X2 are all equal to zeros, then the log-odds Y is equal to -0.60267. Coefficients b1 and b2 both have positive signs, which mean any increase in the value of either X1 or X2, there will be a corresponding increase in the value of log-odds Y. Coefficient b2 is about 3 times the magnitude of coefficient b1.

In terms of the probability p, if X1=1 and X2=0, then the probability value is e^(-0.60267+(0.30802)(1)+(1.00628)(0)) / ( e^(-0.60267+(0.30802)(1)+(1.00628)(0)+1 ) = 0.58. A probability value above 0.5 indicates that the student will graduate in six years. if X1=0 and X2=1, then the probability value is e^(-0.60267+(0.30802)(0)+(1.00628)(1)) / ( e^(-0.60267+(0.30802)(0)+(1.00628)(1))+1 ) = 0.73. Hence, either if the student is athletic or racially white will increase the chance of graduation. If both X1=1 and X2=1, then the probability value is e^(-0.60267+(0.30802)(1)+(1.00628)(1)) / ( e^(-0.60267+(0.30802)(1)+(1.00628)(1))+1 ) = 0.79. Hence, if the student is both athletic and racially white, then there is a very high chance of graduation. However, when examining the actual observed data points, the “athletic” white students have lower chance of graduating than the “non-athletic” white students. Therefore, it appears that the sign of coefficient b1 should be negative instead. Maybe other classification models such as decision tree is a better model for this data.