Statistics Notes

Math 121 - Fall 2025

Jump to: Syllabus, Week 1, Week 2, Week 3, Week 4, Week 5, Week 6, Week 7, Week 8, Week 9, Week 10, Week 11, Week 12, Week 13, Week 14

Week 1 Notes

Day Section Topic
Mon, Aug 25 1.2 Data tables, variables, and individuals
Wed, Aug 27 2.1.3 Histograms & skew
Fri, Aug 29 2.1.5 Boxplots

Mon, Aug 25

Today we covered data tables, individuals, and variables. We also talked about the difference between categorical and quantitative variables.

  1. We looked at a case of a nurse who was accused of killing patients at the hospital where she worked for 18 months. One piece of evidence against her was that 40 patients died during the shifts when she worked, but only 34 died during shifts when she wasn’t working. If this evidence came from a date table, what would be the most natural individuals (rows) & variables (columns) for that table?
  1. In the data table in the example above, who or what are the individuals? What are the variables and which are quantitative and which are categorical?

  2. If we want to compare states to see which are safer, why is it better to compare the rates instead of the total fatalities?

  3. What is wrong with this student’s answer to the previous question?

Rates are better because they are more precise and easier to understand.

I like this incorrect answer because it is a perfect example of bullshit. This student doesn’t know the answer so they are trying to write something that sounds good and earns partial credit. Try to avoid writing bullshit. If you catch yourself writing B.S. on one of my quizzes or tests, then you can be sure that you a missing a really simple idea and you should see if you can figure out what it is.

Wed, Aug 27

Today we did our first in-class workshop:

Before that, we talked about how to summarize data. We talked briefly about making bar charts for categorical data. Then we used the class data we collected last time to introduce histograms and stem-and-leaf plots (also known as stemplots).

We talked about how to tell if data is skewed left or skewed right. We also reviewed the mean and the median.

Median versus Average

The median of NN numbers is located at position N+12\dfrac{N+1}{2}.

The median is not affected by skew, but the average is pulled in the direction of the skew. So the average will be bigger than the median when the data is skewed right, and smaller when the data is skewed left.

We finished by talking about these examples.

  1. Which is greater, the mean or the median household income?

  2. Can you think of a distribution that is skewed left?

  3. Why isn’t this bar graph from the book a histogram?

Until recently, Excel did not have an easy way to make histograms, but Google Sheets does. If you need to make a histogram, I recommend using Google Sheets or this histogram plotter tool.

Fri, Aug 29

We introduced the five number summary and box-and-whisker plots (boxplots). We also talked about the interquartile range (IQR) and how to use the 1.5×IQR1.5 \times \text{IQR} rule to determine if data is an outlier.

We started with this simple example:

  1. An 8 man crew team actually includes 9 men, the 8 rowers and one coxswain. Suppose the weights (in pounds) of the 9 men on a team are as follows:

     120  180  185  200  210  210  215  215  215

    Find the 5-number summary and draw a box-and-whisker plot for this data. Is the coxswain who weighs 120 lbs. an outlier?


Week 2 Notes

Day Section Topic
Mon, Sep 1 Labor day - no class
Wed, Sep 3 2.1.4 Standard deviation
Fri, Sep 5 4.1 Normal distribution

Wed, Sep 3

Today we talked about robust statistics such as the median and IQR that are not affected by outliers and skew. We also introduced the standard deviation. We did this one example of a standard deviation calculation by hand, but you won’t ever have to do that again in this class.

  1. 11 students just completed a nursing program. Here is the number of years it took each student to complete the program. Find the standard deviation of these numbers.

     3  3  3  3  4  4  4  4  5  5  6

From now on we will just use software to find standard deviation. In a spreadsheet (Excel or Google Sheets) you can use the =STDEV() function.

  1. Which of the following data sets has the largest standard deviation?

    1. 1000, 998, 1005
    2. 8, 10, 15, 20, 22, 27
    3. 30, 60, 90

We finished by looking at some examples of histograms that have a shape that looks roughly like a bell. This is a very common pattern in nature that is called the normal distribution.

The normal distribution is a mathematical model for data with a histogram that is shaped like a bell. The model has the following features:

  1. It is symmetric (left & right tails are same size)
  2. The mean (μ\mu) is the same as the median.
  3. It has two inflection points (the two steepest points on the curve)
  4. The distance from the mean to either inflection point is the standard deviation (σ\sigma).
  5. The two numbers μ\mu and σ\sigma completely describe the model.

The normal distribution is a theoretical model that doesn’t have to perfectly match the data to be useful. We use Greek letters μ\mu and σ\sigma for the theoretical mean and standard deviation of the normal distribution to distinguish them from the sample mean x\bar{x} and standard deviation ss of our data which probably won’t follow the theoretical model perfectly.

Fri, Sep 5

We talked about z-values and the 68-95-99.7 rule.

We also did these exercises before the workshop.

  1. In 2020, Farmville got 61 inches of rain total (making 2020 the second wettest year on record). How many standard deviations is this above average?

  2. The average high temperature in Anchorage, AK in January is 21 degrees Fahrenheit, with standard deviation 10. The average high temperature in Honolulu, HI in January is 80°F with σ = 8°F. In which city would it be more unusual to have a high temperature of 57°F in January?


Week 3 Notes

Day Section Topic
Mon, Sep 8 4.1.5 68-95-99.7 rule
Wed, Sep 10 4.1.4 Normal distribution computations
Fri, Sep 12 2.1, 8.1 Scatterplots and correlation

Mon, Sep 8

We introduced how to find percentages on a normal distribution for locations that aren’t exactly 1, 2, or 3 standard deviations away from the mean. I strongly recommend downloading the Probability Distributions app (android version, iOS version) for your phone. We did the following examples.

  1. Finding the percentile from a location on a bell curve. SAT verbal scores are roughly normally distributed with mean μ = 500, and σ = 100. Estimate the percentile of a student with a 560 verbal score.

  2. Finding the percent between two locations. What percent of years will Farmville get between 40 and 50 inches of rain?

  3. Converting a percentile to a location. How much rain would Farmville get in year that was in the 90th percentile?

We also talked about the shorthand notation P(40<X<50)P(40 < X < 50) while literally means “the probability that the outcome X is between 40 and 50”.

  1. What is the percentile of a man who is 6 feet tall (72 inches)?

  2. Estimate the percent of men who are between 6 feet and 6’5” tall.

  3. How tall are men in the 80-th percentile?

  4. Men have an foot print length that is approximately N(25cm,4cm)N(25 \text{cm}, 4 \text{cm}). Women’s footprints are approximately N(19,3)N(19, 3). Find

    1. P(Man>22cm)P(\text{Man} > 22 \text{cm})
    2. P(Woman>22cm)P(\text{Woman} > 22 \text{cm})

Wed, Sep 10

We continued practicing calculations with the normal distribution.

Workshop: Normal distributions 2

After we finished that, we talked about explanatory & response variables (see section 1.2.4 in the book).

Fri, Sep 12

We introduced scatterplots and correlation coefficients with these examples:

  1. What would the correlation between husband and wife ages be in a country where every man married a woman exactly 10 years older? What if every man married a woman exactly half his age?

Important concept: correlation does not change if you change the units or apply a simple linear transformation to the axes. Correlation just measures the strength of the linear trend in the scatterplot.

Another thing to know about the correlation coefficient is that it only measures the strength of a linear trend. The correlation coefficient is not as useful when a scatterplot has a clearly visible nonlinear trend.


Week 4 Notes

Day Section Topic
Mon, Sep 15 8.2 Least squares regression introduction
Wed, Sep 17 8.2 Least squares regression practice
Fri, Sep 19 1.3 Sampling: populations and samples

Mon, Sep 15

We talked about least squares regression. The least squares regression line has these features:

  1. Slope m=Rsysxm = R \frac{s_y}{s_x}
  2. Point (x,y)(\bar{x}, \bar{y})
  3. y-Intercept b=ymxb = \bar{y} - m \bar{x}

You won’t have to calculate the correlation RR or the standard deviations sys_y and sxs_x, but you might have to use them to find the formula for a regression line.

We looked at these examples:

Keep in mind that regression lines have two important applications.

It is important to be able to describe the units of the slope.

  1. What are the units of the slope of the regression line for predicting BAC from the number of beers someone drinks?

  2. What are the units of the slope for predicting someone’s weight from their height?

We also introduced the following concepts.

The coefficient of determination R2R^2 represents the proportion of the variability of the yy-values that follows the trend line. The remaining 1R21-R^2 represents the proportion of the variability that is above and below the trend line.

Regression to the mean. Extreme xx-values tend to have less extreme predicted yy-values in a least squares regression model.

Wed, Sep 17

Before the workshop, we started with these two exercises:

  1. Suppose that the correlation between the heights of fathers and adult sons is R=0.5R = 0.5. Given that both fathers and sons have normally distributed heights with mean 7070 inches and standard deviation 3 inches, find an equation for the least squares regression line.

  2. A sample of 20 college students looked at the relationship between foot print length (cm) and height (in). The sample had the following statistics: x=28.5 cm,y=67.75 in,sx=3.45 cm,sy=5.0 in,R=0.71\bar{x} = 28.5 \text{ cm}, ~\bar{y} = 67.75 \text{ in}, ~ s_x = 3.45 \text{ cm}, ~ s_y = 5.0 \text{ in}, ~ R = 0.71

    1. Find the slope of the regression line to predict height (yy) based on footprint length (xx). Include the units and briefly explain what it means.
    2. If a footprint was 30 cm long, how tall would you predict the subject was?

Fri, Sep 19

We talked about the difference between samples and populations. The central problem of statistics is to use sample statistics to answer questions about population parameters.

We looked at an example of sampling from the Gettysburg address, and we talked about the central problem of statistics: How can you answer questions about the population using samples?

The reason this is hard is because sample statistics usually don’t match the true population parameter. There are two reasons why:

We looked at this case study:

Important Concepts

  1. Bigger samples have less random error.

  2. Bigger samples don’t reduce bias.

  3. The only sure way to avoid bias is a simple random sample.


Week 5 Notes

Day Section Topic
Mon, Sep 22 1.3 Bias versus random error
Wed, Sep 24 Review
Fri, Sep 26 Midterm 1

Mon, Sep 22

We did this workshop.

Wed, Sep 24

We talked about the midterm 1 review problems.


Week 6 Notes

Day Section Topic
Mon, Sep 29 1.4 Randomized controlled experiments
Wed, Oct 1 3.1 Defining probability
Fri, Oct 3 3.1 Multiplication and addition rules

Mon, Sep 29

One of the hardest problems in statistics is to prove causation. Here is a diagram that illustrates the problem.

The explanatory variable might be the cause of a change in the response variable. But we have to watch out for other variables that aren’t part of the study called lurking variables.

A lurking variable that might be associated with both the explanatory and response variable is called a confounding variable.

We say that correlation is not causation because you can’t assume that there is a cause and effect relationship between two variables just because they are strongly associated. The association might be caused by lurking variables or the causal relationship might go in the opposite direction of what you expect.

Experiments versus Observational Studies

An experiment is a study where the individuals are placed into different treatment groups by the researchers. An observational study is one where the researchers do not place the individuals into different treatment groups.

A randomized controlled experiment is one where the individuals are randomly assigned to treatment groups.

Important concept: Random assignment automatically controls all lurking variables, which let’s you establish cause and effect.

We looked at these examples.

  1. A study tried determine whether cellphones cause brain cancer. The researchers interviewed 469 brain cancer patients about their cellphone use between 1994 and 1998. They also interviewed 469 other hospital patients (without brain cancer) who had the same ages, genders, and races as the brain cancer patients.

    1. What was the explanatory variable?
    2. What was the response variable?
    3. Which variables were controlled?
    4. Was this an experiment or an observational study?
    5. Are there any possible lurking variables?
  2. In 1954, the polio vaccine trials were one of the largest randomized controlled experiments ever conducted. Here were the results.

    1. What was the explanatory variable?
    2. What was the response variable?
    3. This was an experiment because it had a treatment variable. What was that?
    4. Which variables were controlled?
    5. Why don’t we have to worry about lurking variables?

We talked about why the polio vaccine trials were double blind and what that means.

Here is one more example we didn’t have time for:

  1. Do magnetic bracelets work to help with arthritis pain?

    1. What is the explanatory variable?
    2. What is the response variable?
    3. How hard would it be to design a randomized controlled experiment to answer the question above?

We finished by talking about anecdotal evidence.

Wed, Oct 1

Today we introduced probability models which always have two parts:

  1. A list of possible outcomes called a sample space.
  2. A probability function P(E)P(E) that gives the probability for any subset EE of the sample space.

A subset of the sample space is called an event. We already intuitively know lots of probability models, for example we described the following probability models:

  1. Flip a coin.

  2. Roll a six-sided die.

  3. If you roll a six-sided die, what is P(result at least 5)?P(\text{result at least 5})?

  4. The proportion of people in the US with each of the four blood types is shown in the table below.

    Type O A B AB
    Proportion 0.45 0.40 0.11 ?

    What is P(Type AB)?P(\text{Type AB})?

Fri, Oct 3

Today we talked about the multiplication and addition rules for probability. We also talked about independent events and conditional probability. We started with these examples.

  1. If you roll two six-sided dice, the results are independent. What is the probability that both dice land on a six?

  2. Suppose you shuffle a deck of 52 playing cards and then draw two cards from the top. Find

    1. P(First card is an ace)P(\text{First card is an ace})
    2. P(Second card is an ace | first is an ace)P(\text{Second card is an ace } | \text{ first is an ace})
    3. P(Second card is an ace | first is not an ace)P(\text{Second card is an ace } | \text{ first is not an ace})

Then we did this workshop:


Week 7 Notes

Day Section Topic
Mon, Oct 6 3.4 Weighted averages & expected value
Wed, Oct 8 3.4 Random variables
Fri, Oct 10 7.1 Sampling distributions

Mon, Oct 6

Today we talked about weighted averages. To find a weighted average:

  1. Multiply each number by its weight.
  2. Add the results.

We did this workshop.

Before that, we did some examples.

  1. Calculate the final grade of a student who gets an 80 quiz average, 72 midterm average, 95 project average, and an 89 on the final exam.

  2. Eleven nursing students graduated from a nursing program. Four students completed the program in 3 years, four took 4 years, two took 5 years, and one student took 6 years to graduate. Express the average time to complete the program as a weighted average.

The expected value (also known as the theoretical average) is the weighted average of the outcomes in a probability model, using the probabilities as the weights.

  1. In roulette there is a wheel with 38 slots. There are 18 red slot, 18 black slots and 2 green slots. When you spin the wheel, you can bet that the ball will land on a black slot. If you bet $1, and the ball lands on black, then you win $2, otherwise you win nothing. What is the expected value for this bet?

The Law of Large Numbers. When you repeat a random experiment many times, the sample mean x\bar{x} tends to get closer to the theoretical average μ\mu.


Week 8 Notes

Day Section Topic
Mon, Oct 13 Fall break - no class
Wed, Oct 15 5.1 Sampling distributions for proportions
Fri, Oct 17 5.2 Confidence intervals for a proportion

Week 9 Notes

Day Section Topic
Mon, Oct 20 5.2 Confidence intervals for a proportion - con’d
Wed, Oct 22 Review
Fri, Oct 24 Midterm 2

Week 10 Notes

Day Section Topic
Mon, Oct 27 5.3 Hypothesis testing for a proportion
Wed, Oct 29 6.1 Inference for a single proportion
Fri, Oct 31 5.3.3 Decision errors

Week 11 Notes

Day Section Topic
Mon, Nov 3 6.2 Difference of two proportions (hypothesis tests)
Wed, Nov 5 6.2.3 Difference of two proportions (confidence intervals)
Fri, Nov 7 7.1 Introducing the t-distribution

Week 12 Notes

Day Section Topic
Mon, Nov 10 7.1.4 One sample t-confidence intervals
Wed, Nov 12 7.2 Paired data
Fri, Nov 14 7.3 Difference of two means

Week 13 Notes

Day Section Topic
Mon, Nov 17 7.3 Difference of two means
Wed, Nov 19 Review
Fri, Nov 21 Midterm 3
Mon, Nov 23 7.4 Statistical power

Week 14 Notes

Day Section Topic
Mon, Dec 1 6.3 Chi-squared statistic
Wed, Dec 3 6.4 Testing association with chi-squared
Fri, Dec 5 Choosing the right technique
Mon, Dec 8 Last day, recap & review