Example: Resume data

Our textbook has an example with data from a 2004 study about whether fictitious job candidates with very Black sounding names (like Lakisha or Jamal) were less likely to get callbacks than candidates with White sounding names (like Emily or Greg) but otherwise identical resumes.

The data includes the following variables.

resume <- read.csv('https://bclins.github.io/spring26/math222/Examples/resume.csv')
# First six rows
head(resume)
##   job_city callback  race sex college_degree honors years_experience military
## 1  Chicago        0 white   f              1      0                6        0
## 2  Chicago        0 white   f              0      0                6        1
## 3  Chicago        0 black   f              1      0                6        0
## 4  Chicago        0 black   f              0      0                6        0
## 5  Chicago        0 white   f              0      0               22        0
## 6  Chicago        0 white   m              1      1                6        0
##   email_address
## 1             0
## 2             1
## 3             0
## 4             1
## 5             1
## 6             0
# Number of rows and columns
dim(resume)
## [1] 4870    9
model <- glm(callback ~ job_city + college_degree + years_experience + honors + military + email_address + race + sex, data = resume, family = 'binomial')
summary(model)
## 
## Call:
## glm(formula = callback ~ job_city + college_degree + years_experience + 
##     honors + military + email_address + race + sex, family = "binomial", 
##     data = resume)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.66318    0.18196 -14.636  < 2e-16 ***
## job_cityChicago  -0.44027    0.11421  -3.855 0.000116 ***
## college_degree   -0.06665    0.12110  -0.550 0.582076    
## years_experience  0.01998    0.01021   1.957 0.050298 .  
## honors            0.76942    0.18581   4.141 3.46e-05 ***
## military         -0.34217    0.21569  -1.586 0.112657    
## email_address     0.21826    0.11330   1.926 0.054057 .  
## racewhite         0.44241    0.10803   4.095 4.22e-05 ***
## sexm             -0.18184    0.13757  -1.322 0.186260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2726.9  on 4869  degrees of freedom
## Residual deviance: 2659.2  on 4861  degrees of freedom
## AIC: 2677.2
## 
## Number of Fisher Scoring iterations: 5

Just like with multiple linear regression, it is a good idea to carefully select which variables to include. Unlike linear regression where we can use the \(R^2_\text{adj}\) to judge which models are better, in logistic regression we can use the Akaike information criterion (AIC) to judge whether a model is both parsimonious and has good predictive power. For AIC, lower numbers are better.

  1. Use backwards selection to find a model with the lowest AIC.