Least Squares Regression

Example: Statistics Midterm Grades

Not surprisingly, there is a strong correlation between a student’s grades on different midterm exams. Here is data from my sections of statistics from 2013-2015.

exams <- read.csv('http://people.hsc.edu/faculty-staff/blins/StatsExamples/midtermRegressionS13_F15.csv')
# First six rows
head(exams)

##   Midterm1 Midterm2
## 1       52       54
## 2       70       73
## 3       86       89
## 4       73       62
## 5       92       85
## 6       47       72

# Number of rows and columns
dim(exams)

## [1] 180   2

# Midterm averages
apply(exams, MARGIN = 2, FUN = mean) # Applies the function mean() to the columns. MARGIN = 1 would apply the function to the rows instead.

## Midterm1 Midterm2 
## 76.77778 74.16667

# Standard deviations
apply(exams, MARGIN = 2, FUN = sd)

## Midterm1 Midterm2 
## 14.51855 13.81017

Scatterplots

It is very easy to make scatterplots in R.

plot(exams$Midterm1, exams$Midterm2, xlab = "Midterm 1 grade", ylab = "Midterm 2 grade", pch = 16)

Correlation

The cor() function calculates the correlation between two different vectors of data.

cor(exams$Midterm1, exams$Midterm2)

## [1] 0.6094617

Linear Regression

The command lm() will construct a least squares linear regression model for predicting one variable based on another. Here lm stands for linear model. The syntax is a little different than other commands. Here is an example of how to use it.

myLM <- lm(Midterm2 ~ Midterm1, data = exams)
myLM

## 
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = exams)
## 
## Coefficients:
## (Intercept)     Midterm1  
##     29.6567       0.5797

Notice that the linear model contains the slope and the y-intercept for the least squares regression line. The slope is the coefficient on the Midterm1 grade, so it is the number 0.5797. This means: for every 1 point higher a student gets on midterm 1, their midterm 2 grade tends to be 0.5797 points higher, on average.

Plotting Trendlines

You can add the trendline to the scatterplot with the command abline().

plot(exams$Midterm1, exams$Midterm2, xlab="Midterm 1 grade", ylab="Midterm 2 grade", pch = 16)
abline(myLM) # Adds the trendline to the scatterplot

Making Predictions

If you want to predict a y-value based on a particular x-value, use the predict() function. The arguments of predict() are a linear model and a data frame with a header that has the same name as the x-variable in the linear model. In this example, the second argument is a simple data frame with two values for Midterm 1, one student who gets a 50 and another student who gets a 100.

predict(myLM, data.frame(Midterm1 = c(50,100)))

##        1        2 
## 58.64292 87.62917

Based on the results, it looks like students who get a 50 on midterm 1 will have an average grade of about 58.6 on midterm 2. On the other hand, students with a perfect score on midterm 1 will only have an average grade of 87.6 on midterm 2. This phenomenon is known as regression to the mean: the predicted outcomes in a linear regression model tend to be less extreme than then inputs.

Checking Inference Conditions with Residual Plots

The function resid() takes a linear model (like myLM) and returns the residuals. You can use this to plot the residuals against the x-values which is often useful if you want to assess whether data really follows a linear model and whether the conditions for statistical inference are met.

plot(exams$Midterm1, resid(myLM)) 
abline(0,0) # Add line with slope = 0 and y-intercept = 0 to the residual plot

It looks like the residuals have a similar variance throughout. But if you look at a qqplot, you can see that the residuals are not normally distributed.

qqnorm(resid(myLM))
qqline(resid(myLM))

That’s probably not a big issue since the sample size is large (\(n = 180\)).

Other Information

A linear model in R contains a lot of useful information that can be accessed using the summary() function. This information will be particularly useful when we apply statistical inference techniques to linear regression.

summary(myLM)

## 
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = exams)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.672  -6.336   1.096   7.918  26.821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.65667    4.41632   6.715 2.43e-10 ***
## Midterm1     0.57972    0.05652  10.256  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.98 on 178 degrees of freedom
## Multiple R-squared:  0.3714, Adjusted R-squared:  0.3679 
## F-statistic: 105.2 on 1 and 178 DF,  p-value: < 2.2e-16

You can get an ANOVA table for a linear model using the anova() function.

anova(myLM)

## Analysis of Variance Table
## 
## Response: Midterm2
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Midterm1    1  12681 12680.7  105.19 < 2.2e-16 ***
## Residuals 178  21458   120.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1