In multilinear regression, we assume that the average y-values are a
(affine) linear function of \(p\)
explanatory variables \(x_1, \ldots,
x_p\): \[\mu_y = \beta_0 + \beta_1 x_1
+ \ldots + \beta_p x_p.\] We can use the lm()
command in R to estimate the coefficients in the model.
The model we get will have the form \[\hat{y} = b_0 + b_1 x_1 + \ldots + b_p x_p.\]
If there are two explanatory variables \(x_1\) and \(x_2\), then you get a regression plane instead of a line.
results <- read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring19/math222/Examples/highbridge2018.csv")
head(results)## place bib gender age state time minutes
## 1 1 66 M 35 VA 1:10:28 70.47
## 2 2 87 M 29 VA 1:18:08 78.13
## 3 3 112 F 32 VA 1:25:47 85.78
## 4 4 116 M 32 VA 1:27:02 87.03
## 5 5 32 M 38 VA 1:27:14 87.23
## 6 6 115 F 31 VA 1:28:15 88.25
##
## Call:
## lm(formula = minutes ~ age + gender, data = results)
##
## Coefficients:
## (Intercept) age genderM
## 105.2479 0.9703 -21.0004
plot(linear_model) do?You can plot a linear model in R. You get several images that help
you check the assumptions for linear regression inference (normal
residuals and constant variance). The par() function lets
you organize the plots into rows & columns.
What factors explain differences in teaching evaluation scores? Here
is data from 463 courses at the University of Texas - Austin that can be
found in the R library moderndive (from the online textbook
ModernDive)
## # A tibble: 6 × 4
## ID score age gender
## <int> <dbl> <int> <fct>
## 1 1 4.7 36 female
## 2 2 4.1 36 female
## 3 3 3.9 36 female
## 4 4 4.8 36 female
## 5 5 4.6 59 male
## 6 6 4.3 59 male
In a simple linear model, the coefficient on the age
variable does not depend on gender, so the lines for
predicting teaching scores are parallel.
##
## Call:
## lm(formula = score ~ age + gender, data = evals_data)
##
## Coefficients:
## (Intercept) age gendermale
## 4.484116 -0.008678 0.190571
A better linear model might have different slopes for men versus women. You can get models like that by adding an iteraction term: \[\mu_y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \underbrace{\beta_{12} x_1 x_2}_{\text{interaction term}}.\]
The two trendlines are not parallel when you include interaction effects. Women appear to have a steeper age penalty on course evaluations than male instructors do.
Here’s how you make a linear model with interactions in R. Notice
that you use * instead of + in the
lm() function.
##
## Call:
## lm(formula = score ~ age * gender, data = evals_data)
##
## Coefficients:
## (Intercept) age gendermale age:gendermale
## 4.88299 -0.01752 -0.44604 0.01353
babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv')
dim(babies)## [1] 1236 8
The babies data frame contains data on baby birth
weights. The variables are:
There are a lot of cells with NA (not available) entries, and these
could mess up our analysis below. The na.omit() command is
a fast way to remove these.
## [1] 1174 8
Whenever you omit data, you should make sure that you aren’t omitting a large percentage of the sample, and you might also want to check the way that the data was collected to make sure that individuals with missing data are not systematically different from other individuals in the sample. In this example, we are omitting 62 rows of data (out of 1236). That’s only about 5% of the data, so we probably aren’t affecting our results too much.
##
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight +
## smoke, data = babies)
##
## Coefficients:
## (Intercept) gestation parity age height weight
## -80.41085 0.44398 -3.32720 -0.00895 1.15402 0.05017
## smoke
## -8.40073
##
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight +
## smoke, data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.613 -10.189 -0.135 9.683 51.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.41085 14.34657 -5.605 2.60e-08 ***
## gestation 0.44398 0.02910 15.258 < 2e-16 ***
## parity -3.32720 1.12895 -2.947 0.00327 **
## age -0.00895 0.08582 -0.104 0.91696
## height 1.15402 0.20502 5.629 2.27e-08 ***
## weight 0.05017 0.02524 1.987 0.04711 *
## smoke -8.40073 0.95382 -8.807 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.83 on 1167 degrees of freedom
## Multiple R-squared: 0.258, Adjusted R-squared: 0.2541
## F-statistic: 67.61 on 6 and 1167 DF, p-value: < 2.2e-16
You can use the select() function from the
dplyr library to select the variables you want to include
in a data frame or tibble.
## bwt gestation parity smoke
## 1 120 284 0 0
## 2 113 282 0 0
## 3 128 279 0 1
## 5 108 282 0 1
## 6 136 286 0 0
## 7 138 244 0 0
In R, you can also use the pipe operator (|>) to
chain functions together in an organized way.
##
## Call:
## lm(formula = bwt ~ gestation + parity + smoke, data = select(babies,
## bwt, gestation, parity, smoke))
##
## Coefficients:
## (Intercept) gestation parity smoke
## -4.3435 0.4584 -3.2673 -8.3884
You can make an ANOVA table for multilinear regression, but it doesn’t contain much information that wasn’t in the summary of the linear model.
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## gestation 1 65450 65450 261.2078 < 2.2e-16 ***
## parity 1 2345 2345 9.3578 0.002271 **
## age 1 216 216 0.8627 0.353179
## height 1 12518 12518 49.9594 2.695e-12 ***
## weight 1 1683 1683 6.7187 0.009660 **
## smoke 1 19437 19437 77.5713 < 2.2e-16 ***
## Residuals 1167 292409 251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We need to check that there is a roughly linear relationship between each of the explanatory variables and the response variable. This also lets you see how the residuals depend on each explanatory variable.
par(mfrow=c(2, 3))
plot(babies$gestation, babies$bwt, xlab='Gestation time (in days)', ylab='Birth weight (in ounces)')
plot(babies$age, babies$bwt, xlab="Mother's age", ylab='Birth weight (in ounces)')
plot(babies$height, babies$bwt, xlab="Mother's height (inches)", ylab='Birth weight (in ounces)')
plot(babies$weight, babies$bwt, xlab="Mother's weight (lbs.)", ylab='Birth weight (in ounces)')
plot(babies$parity, babies$bwt, xlab="Parity", ylab='Birth weight (in ounces)')
plot(babies$smoke, babies$bwt, xlab="Smoker", ylab='Birth weight (in ounces)')
plot(data_frame) do?If you plot a data frame, then you get a matrix of pairwise plots between every pair of variables in the data frame. That can help check that there is a linear relationship between the variables.
We also need to check the residuals to see that they are roughly normally distributed with constant variance. This is harder to do with so many variables. Here are two of the most important things to check:
par(mfrow=c(1,2))
qqnorm(resid(babies_lm))
qqline(resid(babies_lm))
plot(fitted(babies_lm),resid(babies_lm),xlab='Predicted values',ylab='Residuals',main='Residuals vs. Predicted Values')
abline(0,0)You could also get these plots automatically (with some extra junk)
if you plot() the linear model.
These work exactly the same as the single variable case.
## 2.5 % 97.5 %
## (Intercept) -1.085588e+02 -52.26290070
## gestation 3.868886e-01 0.50106812
## parity -5.542197e+00 -1.11220272
## age -1.773287e-01 0.15942813
## height 7.517744e-01 1.55626636
## weight 6.411309e-04 0.09968892
## smoke -1.027213e+01 -6.52933831
predict(babies_lm, data.frame(gestation = 240,height=70,weight=120,age=25,parity=1,smoke=0),interval='prediction')## fit lwr upr
## 1 109.3942 78.07624 140.7122
The asterisk operator in lm() represents both the
individual “main effects” and their interaction. For example:
corresponds to \(\mu_y = \beta_0 + \underbrace{\beta_1 x_1 + \beta_2 x_2}_\text{main effects} + \underbrace{\beta_{12} x_1 x_2}_\text{interaction term}.\)
With more than two variables, you can include several interactions. For example,
would include both the main effects and also the interaction terms for all three pairs of variables in the model: \[\mu_y = \beta_0 + \underbrace{\beta_{1} x_1 + \beta_2 x_2 + \beta_3 x_3}_\text{main effects} + \underbrace{\beta_{12} x_1 x_2 + \beta_{13} x_1 x_3 + \beta_{23} x_2 x_3}_\text{interaction terms}.\]
Including interaction terms between two quantitative variables makes the model nonlinear in the explanatory variables. That isn’t necessarily a bad things, but it adds complexity.
A group of students wanted to investigate which factors influence the price of a house (Koester, Davis, and Ross, 2003). They used http://www.househunt.com, limiting their search to single family homes in California. They collected a stratified sample by stratifying on three different regions in CA (northern, southern, and central), and then randomly selecting a sample from within each strata. They decided to focus on homes that were less than 5000 square feet and sold for less than $1.5 million.
houseData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/housing.txt")
head(houseData)## sqft price City bedrooms baths
## 1 3392 339000 Dublin 3 2.1
## 2 4100 899900 pleasanton 4 3.0
## 3 3200 448641 Clayton 5 4.0
## 4 1436 239999 Moraga 4 3.0
## 5 1944 377500 Antioch 3 2.0
## 6 1500 299900 Danville 3 2.5
Make a linear model to predict housing prices based on square feet, bathrooms, and bedrooms.
Make a confidence interval for the average price of a 2,000 sq. foot house with 2.5 baths, and 4 bedrooms.
How significant are the explanatory variables in the model?