Multiple logistic regression

Example: Resume data

Our textbook has an example with data from a 2004 study about whether fictitious job candidates with very Black sounding names (like Lakisha or Jamal) were less likely to get callbacks than candidates with White sounding names (like Emily or Greg) but otherwise identical resumes.

The data includes the following variables.

callback - Specifies whether the employer called the applicant following submission of the application for the job.
job_city - City where the job was located: Boston or Chicago.
college_degree - An indicator for whether the resume listed a college degree.
years_experience - Number of years of experience listed on the resume.
honors - Indicator for the resume listing some sort of honors, e.g. employee of the month.
military - Indicator for if the resume listed any military experience.
email_address - Indicator for if the resume listed an email address for the applicant.
race - Race of the applicant, implied by their first name listed on the resume.
sex - Sex of the applicant (limited to only male and female in this study), implied by the first name listed on the resume.

resume <- read.csv('https://bclins.github.io/spring26/math222/Examples/resume.csv')
# First six rows
head(resume)

##   job_city callback  race sex college_degree honors years_experience military
## 1  Chicago        0 white   f              1      0                6        0
## 2  Chicago        0 white   f              0      0                6        1
## 3  Chicago        0 black   f              1      0                6        0
## 4  Chicago        0 black   f              0      0                6        0
## 5  Chicago        0 white   f              0      0               22        0
## 6  Chicago        0 white   m              1      1                6        0
##   email_address
## 1             0
## 2             1
## 3             0
## 4             1
## 5             1
## 6             0

# Number of rows and columns
dim(resume)

## [1] 4870    9

model <- glm(callback ~ job_city + college_degree + years_experience + honors + military + email_address + race + sex, data = resume, family = 'binomial')
summary(model)

## 
## Call:
## glm(formula = callback ~ job_city + college_degree + years_experience + 
##     honors + military + email_address + race + sex, family = "binomial", 
##     data = resume)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.66318    0.18196 -14.636  < 2e-16 ***
## job_cityChicago  -0.44027    0.11421  -3.855 0.000116 ***
## college_degree   -0.06665    0.12110  -0.550 0.582076    
## years_experience  0.01998    0.01021   1.957 0.050298 .  
## honors            0.76942    0.18581   4.141 3.46e-05 ***
## military         -0.34217    0.21569  -1.586 0.112657    
## email_address     0.21826    0.11330   1.926 0.054057 .  
## racewhite         0.44241    0.10803   4.095 4.22e-05 ***
## sexm             -0.18184    0.13757  -1.322 0.186260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2726.9  on 4869  degrees of freedom
## Residual deviance: 2659.2  on 4861  degrees of freedom
## AIC: 2677.2
## 
## Number of Fisher Scoring iterations: 5

Just like with multiple linear regression, it is a good idea to carefully select which variables to include. Unlike linear regression where we can use the \(R^2_\text{adj}\) to judge which models are better, in logistic regression we can use the Akaike information criterion (AIC) to judge whether a model is both parsimonious and has good predictive power. For AIC, lower numbers are better.

Use backwards selection to find a model with the lowest AIC.