Least Squares Regression

Example: Beers versus BAC

df <- read.csv('http://people.hsc.edu/faculty-staff/blins/StatsExamples/bac.csv')
# First six rows
head(df)

##   student beers   bac
## 1       1     5 0.100
## 2       2     2 0.030
## 3       3     9 0.190
## 4       4     8 0.120
## 5       5     3 0.040
## 6       6     7 0.095

# Number of rows and columns
dim(df)

## [1] 16  3

# Averages
apply(df, MARGIN = 2, FUN = mean) # Applies the function mean() to the columns. MARGIN = 1 would apply to rows.

## student   beers     bac 
## 8.50000 4.81250 0.07375

# Standard deviations
apply(df, MARGIN = 2, FUN = sd)

##    student      beers        bac 
## 4.76095229 2.19753650 0.04413993

Scatterplots

It is very easy to make scatterplots in R.

plot(df$beers, df$bac, xlab = "Beers", ylab = "BAC", pch = 16)

Correlation

The cor() function calculates the correlation between two different vectors of data.

cor(df$beers, df$bac)

## [1] 0.8943381

Linear Regression

The command lm() will construct a least squares linear regression model for predicting one variable based on another. Here lm stands for linear model. The syntax is a little different than other commands. Here is an example of how to use it.

myLM <- lm(bac ~ beers, data = df)
myLM

## 
## Call:
## lm(formula = bac ~ beers, data = df)
## 
## Coefficients:
## (Intercept)        beers  
##    -0.01270      0.01796

Notice that the linear model contains the slope and the y-intercept for the least squares regression line. The slope is the coefficient on the beers variable, so it is the number 0.01796. The units for the slope are BAC points per beer. The slope tells us that: for every 1 extra beer someone drinks, their BAC tends to increase by 0.0176, on average.

Plotting Trendlines

You can add the trendline to the scatterplot with the command abline().

plot(df$beers, df$bac, xlab="Beers", ylab="BAC", pch = 16)
abline(myLM) # Adds the trendline to the scatterplot

Making Predictions

If you want to predict a y-value based on a particular x-value, use the predict() function. The arguments of predict() are a linear model and a data frame with a header that has the same name as the x-variable in the linear model. In this example, the second argument is a simple data frame with two values for beers, one person who drinks 2 beers and another who drinks 8.

predict(myLM, data.frame(beers = c(2,8)))

##          1          2 
## 0.02322692 0.13100949

Based on the results, it looks like someone who drinks 2 beers will have an average BAC of about 0.023, while someone who drinks 8 beers will have an average BAC of 0.131.

Checking Inference Conditions with Residual Plots

The function resid() takes a linear model (like myLM) and returns the residuals. You can use this to plot the residuals against the x-values which is often useful if you want to assess whether data really follows a linear model and whether the conditions for statistical inference are met.

plot(df$beers, resid(myLM)) 
abline(0,0) # Add line with slope = 0 and y-intercept = 0 to the residual plot

It looks like the residuals have a similar variance throughout, although it is hard to tell with a small sample size like this. But if you look at a qqplot, you can see that the residuals are not normally distributed.

qqnorm(resid(myLM))
qqline(resid(myLM))

That’s might affect the reliability of any confidence intervals we make later since the sample size is small (\(n = 16\)).

Other Information

A linear model in R contains a lot of useful information that can be accessed using the summary() function. This information will be particularly useful when we apply statistical inference techniques to linear regression.

summary(myLM)

## 
## Call:
## lm(formula = bac ~ beers, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.027118 -0.017350  0.001773  0.008623  0.041027 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.012701   0.012638  -1.005    0.332    
## beers        0.017964   0.002402   7.480 2.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02044 on 14 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7855 
## F-statistic: 55.94 on 1 and 14 DF,  p-value: 2.969e-06

You can get an ANOVA table for a linear model using the anova() function.

anova(myLM)

## Analysis of Variance Table
## 
## Response: bac
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## beers      1 0.0233753 0.0233753  55.944 2.969e-06 ***
## Residuals 14 0.0058497 0.0004178                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1