Multiple Regression: Model Selection

Introduction

When we choose a model, we have two goals:

These seem like they would be conflicting goals, but sometimes simple models can have a lot predictive power.

Reasons to remove variables

To make our models parsimonious, we remove variables. Here are reasons to remove variables:

Measuring predictive power

In multilinear regression, we often have variables that aren’t statistically significant contributors to the model. We typically want to remove these variables.

Recall that \[R^2 = \frac{SSM}{SST} \text{ and } 1 - R^2 = \frac{SSE}{SST}.\]

One way to measure the predictive power of a model is to use the adjusted \(R^2\) which is defined by

\[R^2_\text{adj} = 1 - \frac{MSE}{MST}.\]

Example: Babies birth weights

babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv')
dim(babies)
## [1] 1236    8

The babies data frame contains data on baby birth weights. The variables are:

The linear model

Recall from last time that 62 rows out of 1236 were incomplete, so we won’t include those in the linear model.

babies_lm <- babies |>
    na.omit() |>
    lm(bwt ~ gestation + parity + age + height + weight + smoke, data = _)  
babies_lm
## 
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight + 
##     smoke, data = na.omit(babies))
## 
## Coefficients:
## (Intercept)    gestation       parity          age       height       weight  
##   -80.41085      0.44398     -3.32720     -0.00895      1.15402      0.05017  
##       smoke  
##    -8.40073



Which variables to include?

Ideally, you would every subset of variables and see which subset produces the largest \(R^2_\text{adj}\). That is time consuming, even on a computer, when there are a large number of variables. So our textbook suggests two alternatives:



Practice

babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv') |>
    na.omit()
babies_lm <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = babies) 
adj.R.sq0 = summary(babies_lm)$adj.r.squared

So the initial \(R^2_\text{adj} = 0.2541383\).

  1. Starting with the code above, which variable (if any) should you remove first to increase \(R^2_\text{adj}\) the most?
new_lm <- lm(bwt ~ gestation + parity + height + weight + smoke, data = babies) 
summary(new_lm)$adj.r.squared
## [1] 0.25477

Alternative approach

If all we care about is predictive power, then maximizing \(R^2_\text{adj}\) is a good strategy. But if we want to better understand which variables really matter, then we might want to focus on the p-values for each variable in the model instead.

summary(new_lm)
## 
## Call:
## lm(formula = bwt ~ gestation + parity + height + weight + smoke, 
##     data = babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.716 -10.150  -0.159   9.689  51.620 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -80.71321   14.04465  -5.747 1.16e-08 ***
## gestation     0.44408    0.02907  15.276  < 2e-16 ***
## parity       -3.28762    1.06281  -3.093  0.00203 ** 
## height        1.15497    0.20473   5.641 2.11e-08 ***
## weight        0.04983    0.02503   1.991  0.04672 *  
## smoke        -8.39390    0.95117  -8.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.82 on 1168 degrees of freedom
## Multiple R-squared:  0.2579, Adjusted R-squared:  0.2548 
## F-statistic:  81.2 on 5 and 1168 DF,  p-value: < 2.2e-16



Another Example

Our textbook has a case study about Ebay auctions of Mario Kart for the Nintendo Wii. The outcome variable of interest is the total price (price) of an auction, which is the highest bid plus the shipping cost. Among the other variables are the following:

Forward selection

  1. Use forward selection to decide which of the variables cond_new, stock_photo, duration, or wheels to add to the model first. Then keep adding variables until you can’t increase \(R^2_\text{adj}\) any more.
mario <- read.csv("mariokart.csv")
mario_lm <- lm(price ~ cond_new + stock_photo + wheels, data = mario)
summary(mario_lm)$adj.r.squared
## [1] 0.7128315
  1. How much would you predict for the total price for the Mario Kart game if it was used, used a stock photo, and included two wheels and put up for auction during the time period that the Mario Kart data were collected? Make a 90% prediction interval.

  2. Make a plot of the mario data frame and use it to check to see if the linearity assumptions appears valid.

  3. Make a plot of the mario_lm and use it to assess whether the constant variance and normal residuals assumptions hold.

Checking linearity

We need to check that there is a roughly linear relationship between each of the explanatory variables and the response variable. This also lets you see how the residuals depend on each explanatory variable.

plot(mario)



Checking variance & residuals

plot(mario_lm, which = 1:2)