When we choose a model, we have two goals:
These seem like they would be conflicting goals, but sometimes simple models can have a lot predictive power.
To make our models parsimonious, we remove variables. Here are reasons to remove variables:
Overfitting - including lots of variables can lead to overfitting, i.e., modeling the noise from the sample, not the underlying trend.
Insignificant variables - some variables might not have a statistically significant effect on the response, so it usually doesn’t make sense to include them in the model.
Multicolinearity - sometimes two input variables are highly correlated, and including them both in the model is redundant.
In multilinear regression, we often have variables that aren’t statistically significant contributors to the model. We typically want to remove these variables.
Recall that \[R^2 = \frac{SSM}{SST} \text{ and } 1 - R^2 = \frac{SSE}{SST}.\]
One way to measure the predictive power of a model is to use the adjusted \(R^2\) which is defined by
\[R^2_\text{adj} = 1 - \frac{MSE}{MST}.\]
babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv')
dim(babies)## [1] 1236 8
The babies data frame contains data on baby birth
weights. The variables are:
Recall from last time that 62 rows out of 1236 were incomplete, so we won’t include those in the linear model.
babies_lm <- babies |>
na.omit() |>
lm(bwt ~ gestation + parity + age + height + weight + smoke, data = _)
babies_lm##
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight +
## smoke, data = na.omit(babies))
##
## Coefficients:
## (Intercept) gestation parity age height weight
## -80.41085 0.44398 -3.32720 -0.00895 1.15402 0.05017
## smoke
## -8.40073
Ideally, you would every subset of variables and see which subset produces the largest \(R^2_\text{adj}\). That is time consuming, even on a computer, when there are a large number of variables. So our textbook suggests two alternatives:
babies <- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv') |>
na.omit()
babies_lm <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = babies)
adj.R.sq0 = summary(babies_lm)$adj.r.squaredSo the initial \(R^2_\text{adj} = 0.2541383\).
new_lm <- lm(bwt ~ gestation + parity + height + weight + smoke, data = babies)
summary(new_lm)$adj.r.squared## [1] 0.25477
If all we care about is predictive power, then maximizing \(R^2_\text{adj}\) is a good strategy. But if we want to better understand which variables really matter, then we might want to focus on the p-values for each variable in the model instead.
##
## Call:
## lm(formula = bwt ~ gestation + parity + height + weight + smoke,
## data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.716 -10.150 -0.159 9.689 51.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.71321 14.04465 -5.747 1.16e-08 ***
## gestation 0.44408 0.02907 15.276 < 2e-16 ***
## parity -3.28762 1.06281 -3.093 0.00203 **
## height 1.15497 0.20473 5.641 2.11e-08 ***
## weight 0.04983 0.02503 1.991 0.04672 *
## smoke -8.39390 0.95117 -8.825 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.82 on 1168 degrees of freedom
## Multiple R-squared: 0.2579, Adjusted R-squared: 0.2548
## F-statistic: 81.2 on 5 and 1168 DF, p-value: < 2.2e-16
Our textbook has a case study about Ebay auctions of Mario Kart for
the Nintendo Wii. The outcome variable of interest is the total price
(price) of an auction, which is the highest bid plus the
shipping cost. Among the other variables are the following:
cond_new: Indicator variable for if the game is new (1)
or used (0).stock_photo: Indicator variable for if the auction’s
main photo is a stock photo.duration: The length of the auction, in days, taking
values from 1 to 10.wheels: The number of Wii wheels included with the
auction. A Wii wheel is an optional steering wheel accessory that holds
the Wii controller.cond_new, stock_photo, duration,
or wheels to add to the model first. Then keep adding
variables until you can’t increase \(R^2_\text{adj}\) any more.mario <- read.csv("mariokart.csv")
mario_lm <- lm(price ~ cond_new + stock_photo + wheels, data = mario)
summary(mario_lm)$adj.r.squared## [1] 0.7128315
How much would you predict for the total price for the Mario Kart game if it was used, used a stock photo, and included two wheels and put up for auction during the time period that the Mario Kart data were collected? Make a 90% prediction interval.
Make a plot of the mario data frame and use it to
check to see if the linearity assumptions appears valid.
Make a plot of the mario_lm and use it to assess
whether the constant variance and normal residuals assumptions
hold.
We need to check that there is a roughly linear relationship between each of the explanatory variables and the response variable. This also lets you see how the residuals depend on each explanatory variable.