Introduction

In multilinear regression, we assume that the average y-values are a (affine) linear function of $p$ explanatory variables $x_1, \ldots, x_p$: \[\mu_y = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p.\] We can use the lm() command in R to estimate the coefficients in the model.

The model we get will have the form \[\hat{y} = b_0 + b_1 x_1 + \ldots + b_p x_p.\]

Regression (hyper)planes

If there are two explanatory variables $x_1$ and $x_2$, then you get a regression plane instead of a line.

**Figure:** A regression plane (Source: Stanford Stats 202 Book)

Example: 2018 High Bridge half-marathon

results <- read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring19/math222/Examples/highbridge2018.csv")
head(results)

##   place bib gender age state    time minutes
## 1     1  66      M  35    VA 1:10:28   70.47
## 2     2  87      M  29    VA 1:18:08   78.13
## 3     3 112      F  32    VA 1:25:47   85.78
## 4     4 116      M  32    VA 1:27:02   87.03
## 5     5  32      M  38    VA 1:27:14   87.23
## 6     6 115      F  31    VA 1:28:15   88.25

myLM <- lm(minutes ~ age + gender, data = results)
myLM

## 
## Call:
## lm(formula = minutes ~ age + gender, data = results)
## 
## Coefficients:
## (Intercept)          age      genderM  
##    105.2479       0.9703     -21.0004

What does `plot(linear_model)` do?

You can plot a linear model in R. You get several images that help you check the assumptions for linear regression inference (normal residuals and constant variance). The par() function lets you organize the plots into rows & columns.

par(mfrow=c(2,2))
plot(myLM)

Example: Course evaluations

What factors explain differences in teaching evaluation scores? Here is data from 463 courses at the University of Texas - Austin that can be found in the R library moderndive (from the online textbook ModernDive)

library(moderndive)
evals_data <- select(evals, ID, score, age, gender)
head(evals_data)

## # A tibble: 6 × 4
##      ID score   age gender
##   <int> <dbl> <int> <fct> 
## 1     1   4.7    36 female
## 2     2   4.1    36 female
## 3     3   3.9    36 female
## 4     4   4.8    36 female
## 5     5   4.6    59 male  
## 6     6   4.3    59 male

A simple linear model

In a simple linear model, the coefficient on the age variable does not depend on gender, so the lines for predicting teaching scores are parallel.

simple_lm <- lm(score ~ age + gender, data = evals_data)
simple_lm

## 
## Call:
## lm(formula = score ~ age + gender, data = evals_data)
## 
## Coefficients:
## (Intercept)          age   gendermale  
##    4.484116    -0.008678     0.190571

Linear models with interaction

A better linear model might have different slopes for men versus women. You can get models like that by adding an iteraction term: \[\mu_y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \underbrace{\beta_{12} x_1 x_2}_{\text{interaction term}}.\]

The two trendlines are not parallel when you include interaction effects. Women appear to have a steeper age penalty on course evaluations than male instructors do.

Interaction model

Here’s how you make a linear model with interactions in R. Notice that you use * instead of + in the lm() function.

interaction_lm <- lm(score ~ age * gender, data = evals_data)
interaction_lm

## 
## Call:
## lm(formula = score ~ age * gender, data = evals_data)
## 
## Coefficients:
##    (Intercept)             age      gendermale  age:gendermale  
##        4.88299        -0.01752        -0.44604         0.01353

Based on the output above, what is the slope for male instructors? What is the slope for female instructors?

Example: Babies birth weights

babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv')
dim(babies)

## [1] 1236    8

The babies data frame contains data on baby birth weights. The variables are:

bwt - birth weight (in ounces)
gestation - length of the pregnancy (in days)
parity - 1 if baby was first born, 0 otherwise
age - mother’s age (in years)
height - mother’s height (in inches)
weight - mother’s weight (in lbs.)
smoke - 1 if the mother is a smoker, 0 otherwise

Cleaning up the data

There are a lot of cells with NA (not available) entries, and these could mess up our analysis below. The na.omit() command is a fast way to remove these.

babies <- na.omit(babies)
dim(babies)

## [1] 1174    8

Whenever you omit data, you should make sure that you aren’t omitting a large percentage of the sample, and you might also want to check the way that the data was collected to make sure that individuals with missing data are not systematically different from other individuals in the sample. In this example, we are omitting 62 rows of data (out of 1236). That’s only about 5% of the data, so we probably aren’t affecting our results too much.

The linear model

babies_lm <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = babies)
babies_lm

## 
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight + 
##     smoke, data = babies)
## 
## Coefficients:
## (Intercept)    gestation       parity          age       height       weight  
##   -80.41085      0.44398     -3.32720     -0.00895      1.15402      0.05017  
##       smoke  
##    -8.40073

Not every variable is significant

summary(babies_lm)

## 
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight + 
##     smoke, data = babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.613 -10.189  -0.135   9.683  51.713 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -80.41085   14.34657  -5.605 2.60e-08 ***
## gestation     0.44398    0.02910  15.258  < 2e-16 ***
## parity       -3.32720    1.12895  -2.947  0.00327 ** 
## age          -0.00895    0.08582  -0.104  0.91696    
## height        1.15402    0.20502   5.629 2.27e-08 ***
## weight        0.05017    0.02524   1.987  0.04711 *  
## smoke        -8.40073    0.95382  -8.807  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.83 on 1167 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.2541 
## F-statistic: 67.61 on 6 and 1167 DF,  p-value: < 2.2e-16

Selecting variables

You can use the select() function from the dplyr library to select the variables you want to include in a data frame or tibble.

head(select(babies, bwt, gestation, parity, smoke))

##   bwt gestation parity smoke
## 1 120       284      0     0
## 2 113       282      0     0
## 3 128       279      0     1
## 5 108       282      0     1
## 6 136       286      0     0
## 7 138       244      0     0

In R, you can also use the pipe operator (|>) to chain functions together in an organized way.

babies |>
    select(bwt, gestation, parity, smoke) |>
    lm(bwt ~ gestation + parity + smoke, data = _)

## 
## Call:
## lm(formula = bwt ~ gestation + parity + smoke, data = select(babies, 
##     bwt, gestation, parity, smoke))
## 
## Coefficients:
## (Intercept)    gestation       parity        smoke  
##     -4.3435       0.4584      -3.2673      -8.3884

ANOVA with multilinear regression

You can make an ANOVA table for multilinear regression, but it doesn’t contain much information that wasn’t in the summary of the linear model.

anova(babies_lm)

## Analysis of Variance Table
## 
## Response: bwt
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## gestation    1  65450   65450 261.2078 < 2.2e-16 ***
## parity       1   2345    2345   9.3578  0.002271 ** 
## age          1    216     216   0.8627  0.353179    
## height       1  12518   12518  49.9594 2.695e-12 ***
## weight       1   1683    1683   6.7187  0.009660 ** 
## smoke        1  19437   19437  77.5713 < 2.2e-16 ***
## Residuals 1167 292409     251                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking linearity

We need to check that there is a roughly linear relationship between each of the explanatory variables and the response variable. This also lets you see how the residuals depend on each explanatory variable.

par(mfrow=c(2, 3))
plot(babies$gestation, babies$bwt, xlab='Gestation time (in days)', ylab='Birth weight (in ounces)')
plot(babies$age, babies$bwt, xlab="Mother's age", ylab='Birth weight (in ounces)')
plot(babies$height, babies$bwt, xlab="Mother's height (inches)", ylab='Birth weight (in ounces)')
plot(babies$weight, babies$bwt, xlab="Mother's weight (lbs.)", ylab='Birth weight (in ounces)')
plot(babies$parity, babies$bwt, xlab="Parity", ylab='Birth weight (in ounces)')
plot(babies$smoke, babies$bwt, xlab="Smoker", ylab='Birth weight (in ounces)')

What does `plot(data_frame)` do?

If you plot a data frame, then you get a matrix of pairwise plots between every pair of variables in the data frame. That can help check that there is a linear relationship between the variables.

plot(babies)

Checking the residuals

We also need to check the residuals to see that they are roughly normally distributed with constant variance. This is harder to do with so many variables. Here are two of the most important things to check:

Plot the residuals versus predicted $\hat{y}$ values (residuals versus fitted data)
A normal quantile plot of residuals (to check for normality)

par(mfrow=c(1,2))
qqnorm(resid(babies_lm))
qqline(resid(babies_lm))
plot(fitted(babies_lm),resid(babies_lm),xlab='Predicted values',ylab='Residuals',main='Residuals vs. Predicted Values')
abline(0,0)

You could also get these plots automatically (with some extra junk) if you plot() the linear model.

Confidence & prediction intervals

These work exactly the same as the single variable case.

confint(babies_lm)

##                     2.5 %       97.5 %
## (Intercept) -1.085588e+02 -52.26290070
## gestation    3.868886e-01   0.50106812
## parity      -5.542197e+00  -1.11220272
## age         -1.773287e-01   0.15942813
## height       7.517744e-01   1.55626636
## weight       6.411309e-04   0.09968892
## smoke       -1.027213e+01  -6.52933831

predict(babies_lm, data.frame(gestation = 240,height=70,weight=120,age=25,parity=1,smoke=0),interval='prediction')

##        fit      lwr      upr
## 1 109.3942 78.07624 140.7122

Interaction with more than two variables

The asterisk operator in lm() represents both the individual “main effects” and their interaction. For example:

lm(y ~ x1 * x2, data = df)

corresponds to $\mu_y = \beta_0 + \underbrace{\beta_1 x_1 + \beta_2 x_2}_\text{main effects} + \underbrace{\beta_{12} x_1 x_2}_\text{interaction term}.$

With more than two variables, you can include several interactions. For example,

lm(y ~ x1*x2 + x1*x3 + x2*x3, data = df)

would include both the main effects and also the interaction terms for all three pairs of variables in the model: \[\mu_y = \beta_0 + \underbrace{\beta_{1} x_1 + \beta_2 x_2 + \beta_3 x_3}_\text{main effects} + \underbrace{\beta_{12} x_1 x_2 + \beta_{13} x_1 x_3 + \beta_{23} x_2 x_3}_\text{interaction terms}.\]

Including interaction terms between two quantitative variables makes the model nonlinear in the explanatory variables. That isn’t necessarily a bad things, but it adds complexity.

Practice

A group of students wanted to investigate which factors influence the price of a house (Koester, Davis, and Ross, 2003). They used http://www.househunt.com, limiting their search to single family homes in California. They collected a stratified sample by stratifying on three different regions in CA (northern, southern, and central), and then randomly selecting a sample from within each strata. They decided to focus on homes that were less than 5000 square feet and sold for less than $1.5 million.

houseData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/housing.txt")
head(houseData)

##   sqft  price       City bedrooms baths
## 1 3392 339000     Dublin        3   2.1
## 2 4100 899900 pleasanton        4   3.0
## 3 3200 448641    Clayton        5   4.0
## 4 1436 239999     Moraga        4   3.0
## 5 1944 377500    Antioch        3   2.0
## 6 1500 299900   Danville        3   2.5

Make a linear model to predict housing prices based on square feet, bathrooms, and bedrooms.
Make a confidence interval for the average price of a 2,000 sq. foot house with 2.5 baths, and 4 bedrooms.
How significant are the explanatory variables in the model?

Multiple Regression