results = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring19/math222/Examples/highbridge2018.csv")
head(results)
## place bib gender age state time minutes
## 1 1 66 M 35 VA 1:10:28 70.47
## 2 2 87 M 29 VA 1:18:08 78.13
## 3 3 112 F 32 VA 1:25:47 85.78
## 4 4 116 M 32 VA 1:27:02 87.03
## 5 5 32 M 38 VA 1:27:14 87.23
## 6 6 115 F 31 VA 1:28:15 88.25
The variables in the results data frame are:
How many people ran the High Bridge half-marathon in 2018? Use
the nrow() function.
Make a histogram of the runners’ times using the
hist() function. What is the shape of the distribution?
Would you say it is skewed left, skewed right, or symmetric? Based on
the shape of the distribution, which would you expect to be larger: the
mean or the median?
Use the summary() function to find the mean and the
five number summary of the race times. Is your prediction about the mean
vs. the median from the last problem correct?
To analyze a categorical variable like the state where the
runners are from, you can make a table using the table()
function. Do this. How many runners are from each state?
To make a bargraph to visualize how many runners are from each
state, you can combine the barplot() function with the
table() function. Try this.
Use the function class() to determine the data type
that R is using for each of the variables in the data frame above. How
many different data types are there in this data frame?
What percent of runners were male/female? Hint: try making a table and dividing the table by the total number of runners.
Make two different barplots, one to show the number of male vs. female runners, the other to show the percents.
Try the command
boxplot(minutes ~ gender, data = results). What do you get?
Note: the tilde (~) operator is used in R to specify that
variables are related to each other.
Plotting functions like hist, barplot,
and boxplot accept optional arguments to add features like
titles (main="Your Title"), labels
(xlab="", ylab=""), and color (col=""). Try to
add a title and labels to the previous plot and change the color to
“lightblue”.
Add another column to the results data frame with a
variable called speed that gives each runner’s average
speed (in miles per hour) for the race. (A half-marathon is 13.1 miles.)
Then use the summary() function to give a quick summary of
the speeds of the runners.
What if we are only interested in the times of men? To get a subset
of a data frame, you can use the subset() function. Here is
an example:
men = subset(results, results$gender == 'M')
The first argument of the subset() function is a data
frame and the second argument is a logical expression. You can use the
operations ==, <, >,
<=, >= to create a logical expression.
You can also use the symbol & to combine one logical
expression and another. The result will be
TRUE if both logical expressions were true and
FALSE otherwise. To get the logical or,
use the symbol |.
Use this idea to answer the following additional questions.
How many men ran the half-marathon?
Make a data frame with information about the women who ran the marathon.
What was the average time for runners who were at least 50 years old?
How would you make a data frame for runners in their twenties only?
What about a data frame with any runner who is a woman or at least 50 years old?