df <- read.csv('http://people.hsc.edu/faculty-staff/blins/StatsExamples/bac.csv')
# First six rows
head(df)
## student beers bac
## 1 1 5 0.100
## 2 2 2 0.030
## 3 3 9 0.190
## 4 4 8 0.120
## 5 5 3 0.040
## 6 6 7 0.095
# Number of rows and columns
dim(df)
## [1] 16 3
# Averages
apply(df, MARGIN = 2, FUN = mean) # Applies the function mean() to the columns. MARGIN = 1 would apply to rows.
## student beers bac
## 8.50000 4.81250 0.07375
# Standard deviations
apply(df, MARGIN = 2, FUN = sd)
## student beers bac
## 4.76095229 2.19753650 0.04413993
It is very easy to make scatterplots in R.
plot(df$beers, df$bac, xlab = "Beers", ylab = "BAC", pch = 16)
The cor() function calculates the correlation between
two different vectors of data.
cor(df$beers, df$bac)
## [1] 0.8943381
The command lm() will construct a least squares linear
regression model for predicting one variable based on another. Here
lm stands for linear model. The syntax is
a little different than other commands. Here is an example of how to use
it.
myLM <- lm(bac ~ beers, data = df)
myLM
##
## Call:
## lm(formula = bac ~ beers, data = df)
##
## Coefficients:
## (Intercept) beers
## -0.01270 0.01796
Notice that the linear model contains the slope and the y-intercept
for the least squares regression line. The slope is the coefficient on
the beers variable, so it is the number 0.01796. The units
for the slope are BAC points per beer. The slope tells us that: for
every 1 extra beer someone drinks, their BAC tends to increase by
0.0176, on average.
You can add the trendline to the scatterplot with the command
abline().
plot(df$beers, df$bac, xlab="Beers", ylab="BAC", pch = 16)
abline(myLM) # Adds the trendline to the scatterplot
If you want to predict a y-value based on a particular x-value, use
the predict() function. The arguments of
predict() are a linear model and a data frame with a header
that has the same name as the x-variable in the linear model. In this
example, the second argument is a simple data frame with two values for
beers, one person who drinks 2 beers and another who drinks
8.
predict(myLM, data.frame(beers = c(2,8)))
## 1 2
## 0.02322692 0.13100949
Based on the results, it looks like someone who drinks 2 beers will have an average BAC of about 0.023, while someone who drinks 8 beers will have an average BAC of 0.131.
The function resid() takes a linear model (like
myLM) and returns the residuals. You can use this to plot
the residuals against the x-values which is often useful if you want to
assess whether data really follows a linear model and whether the
conditions for statistical inference are met.
plot(df$beers, resid(myLM))
abline(0,0) # Add line with slope = 0 and y-intercept = 0 to the residual plot
It looks like the residuals have a similar variance throughout, although it is hard to tell with a small sample size like this. But if you look at a qqplot, you can see that the residuals are not normally distributed.
qqnorm(resid(myLM))
qqline(resid(myLM))
That’s might affect the reliability of any confidence intervals we make later since the sample size is small (\(n = 16\)).
A linear model in R contains a lot of useful information that can be
accessed using the summary() function. This information
will be particularly useful when we apply statistical inference
techniques to linear regression.
summary(myLM)
##
## Call:
## lm(formula = bac ~ beers, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027118 -0.017350 0.001773 0.008623 0.041027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.012701 0.012638 -1.005 0.332
## beers 0.017964 0.002402 7.480 2.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02044 on 14 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7855
## F-statistic: 55.94 on 1 and 14 DF, p-value: 2.969e-06
You can get an ANOVA table for a linear model using the
anova() function.
anova(myLM)
## Analysis of Variance Table
##
## Response: bac
## Df Sum Sq Mean Sq F value Pr(>F)
## beers 1 0.0233753 0.0233753 55.944 2.969e-06 ***
## Residuals 14 0.0058497 0.0004178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1