Our textbook has an example with data from a 2004 study about whether fictitious job candidates with very Black sounding names (like Lakisha or Jamal) were less likely to get callbacks than candidates with White sounding names (like Emily or Greg) but otherwise identical resumes.
The data includes the following variables.
callback - Specifies whether the employer called the
applicant following submission of the application for the job.job_city - City where the job was located: Boston or
Chicago.college_degree - An indicator for whether the resume
listed a college degree.years_experience - Number of years of experience listed
on the resume.honors - Indicator for the resume listing some sort of
honors, e.g. employee of the month.military - Indicator for if the resume listed any
military experience.email_address - Indicator for if the resume listed an
email address for the applicant.race - Race of the applicant, implied by their first
name listed on the resume.sex - Sex of the applicant (limited to only male and
female in this study), implied by the first name listed on the
resume.resume <- read.csv('https://bclins.github.io/spring26/math222/Examples/resume.csv')
# First six rows
head(resume)
## job_city callback race sex college_degree honors years_experience military
## 1 Chicago 0 white f 1 0 6 0
## 2 Chicago 0 white f 0 0 6 1
## 3 Chicago 0 black f 1 0 6 0
## 4 Chicago 0 black f 0 0 6 0
## 5 Chicago 0 white f 0 0 22 0
## 6 Chicago 0 white m 1 1 6 0
## email_address
## 1 0
## 2 1
## 3 0
## 4 1
## 5 1
## 6 0
# Number of rows and columns
dim(resume)
## [1] 4870 9
model <- glm(callback ~ job_city + college_degree + years_experience + honors + military + email_address + race + sex, data = resume, family = 'binomial')
summary(model)
##
## Call:
## glm(formula = callback ~ job_city + college_degree + years_experience +
## honors + military + email_address + race + sex, family = "binomial",
## data = resume)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.66318 0.18196 -14.636 < 2e-16 ***
## job_cityChicago -0.44027 0.11421 -3.855 0.000116 ***
## college_degree -0.06665 0.12110 -0.550 0.582076
## years_experience 0.01998 0.01021 1.957 0.050298 .
## honors 0.76942 0.18581 4.141 3.46e-05 ***
## military -0.34217 0.21569 -1.586 0.112657
## email_address 0.21826 0.11330 1.926 0.054057 .
## racewhite 0.44241 0.10803 4.095 4.22e-05 ***
## sexm -0.18184 0.13757 -1.322 0.186260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2726.9 on 4869 degrees of freedom
## Residual deviance: 2659.2 on 4861 degrees of freedom
## AIC: 2677.2
##
## Number of Fisher Scoring iterations: 5
Just like with multiple linear regression, it is a good idea to carefully select which variables to include. Unlike linear regression where we can use the \(R^2_\text{adj}\) to judge which models are better, in logistic regression we can use the Akaike information criterion (AIC) to judge whether a model is both parsimonious and has good predictive power. For AIC, lower numbers are better.