| Day | Section | Topic |
|---|---|---|
| Mon, Jan 12 | Working with R and Rstudio | |
| Wed, Jan 14 | 1.3 | Sampling principles and strategies |
| Fri, Jan 16 | 1.4 | Experiments |
Today we went over the course syllabus and talked about making R-markdown files in Rstudio. We started the following lab in class, I recommend finishing the second half on your own. I also recommend installing Rstudio on your own laptop (it’s free).
Today we reviewed populations and samples. We started with a famous example of a bad sample.
Then we reviewed population parameters, sample
statistics, and sampling frames. The
difference between a sample statistic and a population parameter is
called the sample error.
There are two sources of sample error:
Bias. Can be caused by a non-representative sample (sample bias) or by measurement errors, non-response, or biased questions (non-sample bias). The only way to avoid sample bias is a simple random sample (SRS) from the whole population.
Random error. This is non-systematic error. It tends to get smaller with larger samples.
To summarize:
We finished with this workshop.
If you find an association between an explanatory variable and a response variable in an observational study, then you can’t say for sure that the explanatory variable is the cause. We say that correlation is not causation because there might be lurking variables that are confounders, that is, they are associated with both the explanatory and response variables and so you can tell what is the true cause.
It turns out that randomized experiments can prove cause and effect because random assignment to treatment groups controls all lurking variables. We also talked about blocking and double-blind experiments.
Example: 1954 polio vaccine trials
Workshop: Experiments
We finished by simulating the results of the polio vaccine trials to see if they might just be a random fluke. We wrote this R code in class:
results = c()
trials <- 1000
for (x in 1:trials) {
simulated.result <- sample(c(0,1), size = 244, replace = TRUE)
percent <- sum(simulated.result) / 244
results <- c(results, percent)
}
hist(results)
sum(results < 0.336) / trials| Day | Section | Topic |
|---|---|---|
| Mon, Jan 19 | Martin Luther King day - no class | |
| Wed, Jan 21 | 2.1 | Examining numerical data |
| Fri, Jan 23 | 3.2 | Conditional probability |
Today we did a lab about using R to visualize data.
You should be able to open this file in your browser, then hit CTRL-A and CTRL-C to select it and copy it so that you can paste it into Rstudio as an R-markdown document.
We had a little trouble with R-markdown on the lab computers.
Last time we talked about how to visualize data with R. Here are two quick summaries of how to make plots in R:
After that, we started talking about probability. We review some of the basic rules.
The notation means “the probability of B given that A happened”. Two events and are independent if the probability of does not depend on whether or not happens. We did the following examples.
We also talked about tree diagrams (see subsection 3.2.7 from the book) and how to use them to compute probabilities.
Based on a study of women in the United States and Germany, there is an 0.8% chance that a woman in her forties has breast cancer. Mammograms are 90% accurate at detecting breast cancer if someone has it. They are also 93% accurate at not detecting cancer in people who don’t have it. If a woman in her forties tests positive for cancer on a mammogram screening, what is the probability that she actually has breast cancer?
5% of men are color blind, but only 0.25% of women are. Find .
| Day | Section | Topic |
|---|---|---|
| Mon, Jan 26 | Class canceled (snow) | |
| Wed, Jan 28 | 4.1 | Normal distribution |
| Fri, Jan 30 | 3.4 | Random variables |
Class was canceled today because I had a doctor’s appointment. But I
recommended that everyone watch the following video and then complete a
workshop about the R functions pnorm, qnorm,
and rnorm.
Today we talked about random variables and probability distributions. We talked about some example probability distributions:
Flip a coin until you get a tail. Let represent the number of flips needed. (geometric distribution)
About 1 meteorite bigger than 1000 kg hits the Earth every year. The time until the next meteorite hits the Earth has probability density function . (exponential distribution)
We talked about the difference between continuous and discrete probability distributions. Then we introduced expected value.
If is a discrete random variable, then the expected value of is If is a continuous random variable with probability density function , then the expected value of is
We did the following example.
We finished by talking about what we mean when we say something is “expected”.
If you repeat a random experiment many times, then the average outcome tends to get close to the expected value.
| Day | Section | Topic |
|---|---|---|
| Mon, Feb 2 | 3.4 | Random variables - con’d |
| Wed, Feb 4 | 4.3 | Binomial distribution |
| Fri, Feb 6 | 5.1 | Point estimates and error |
For a random variable with expected value , the variance of is The standard deviation of (denoted ) is the square root of the variance.
We did these examples in class.
Here is an extra example from Kahn academy that we did not do in class.
Suppose a random variable has the following probability model.
|
0 |
1 |
2 |
3 |
4 |
|
|
0.1 |
0.15 |
0.4 |
0.25 |
0.1 |
Expected value is linear which means that for any two random variables and and any constant , these two properties hold:
Variance is not linear. Instead it has these properties:
A single six-sided die has expected value and standard deviation . What is the mean and standard deviation if you roll two dice and add them?
Binomial distribution. If is the total number of successes in independent trials, each with probability of a success, then has a binomial distribution, denoted for short. This distribution has
We used this binomial distribution plotting tool to compare the distributions if you make these two bets 100 times. In one case we get something that looks roughly like a bell curve, in the other case we get something that is definitely skewed to the right.
pbinom(x, n, p) function in R.Sometimes the assumption that the trials are independent is not justified.
The correct probability distribution to model the example above is called the hypergeometric distribution. As long as the population is much larger than the sample, we typically do not need to worry about the trials not being independent.
We finished by discussing the normal approximation of a binomial distribution. When is large enough so that both and , then is approximately normal with mean and standard deviation .
Suppose that are independent random variables that all have the same probability distribution. If is large, then the total has an approximately normal distribution.
If each has mean and standard deviation , then what is the mean and the standard deviation of the total?
In Dungeons and Dragons, you calculate the damage from a fireball spell by rolling 8 six-sided dice and adding up the results. This has an approximately normal distribution. What is the mean and standard deviation of this distribution. (Recall that the mean and standard deviation of a single six-sided die is and ).
We looked at a graph of the distribution from the previous example to see that it is indeed approximately normal.
When you use a normal approximation to estimate discrete probabilities, it is recommended to use a continuity correction (see Section 4.3.3). To estimate , calculate using the normal approximation (and likewise, to estimate , compute using the normal approximation).
An important special case of the central limit theorem is the normal approximation of the binomial distribution, which has mean and standard deviation .
pbinom(x, n, p) function.We finished by talking about the difference between the distribution of the total versus the distribution of the proportion of patients who are O-negative. The standard deviation of the sample proportion is
| Day | Section | Topic |
|---|---|---|
| Mon, Feb 9 | 5.2 | Confidence intervals for a proportion |
| Wed, Feb 11 | Review | |
| Fri, Feb 13 | Midterm 1 |
Today we talked about confidence intervals for a proportion.
Sampling Distribution for a Sample Proportion. In a SRS of size from a large population, the sample proportion is random, so it has a probability distribution with the following features.
In practice, we usually don’t know the population proportion . Instead we can use the sample proportion to calculate the standard error of :
If the sample size is large enough, then there is a 95% chance that will be within about two standard deviations of . So if we know and we assume that the standard error is close to the standard deviation for , then we can make a confidence interval for the location of the parameter .
Confidence Interval for a Proportion. This works well if the sample size is very large.
You can use the R command qnorm((1 - p) / 2) to find the
critical z-value
()
when you want a specific confidence level
.
After that, we talked about the prop.test() function in
R which can make a confidence interval (among other things).
Notice that the prop.test() confidence interval is not
the same as what we got using the formula above. Instead of using the
formula above, R uses something called a Wilson
score confidence interval with continuity correction. The idea is to
solve for the two points
where
If you add in the continuity correction, this pretty much guarantees
that there is at least a 95% chance (or whatever other confidence level
you want) that the interval contains the true population parameter. The
Wilson method confidence intervals are fairly trustworthy even with
relatively small samples and small numbers of successes/failures.
Today we went over the midterm 1 review problems (the solutions are also available now). We also did some additional practice problems including these.
If you draw a random card from a deck of 52 playing cards, what is the probability that you draw an ace or a heart?
Suppose you need knee surgery. There is an 11% that the surgery fails. There is a 4% chance of getting an infection. And there is a 3% chance of both infection and the surgery failing. What is the probability that the surgery succeeds without infection?
In the Wimbledon tennis tournament, serving players are more likely to win a point. A server has two chances to serve the ball. There is a 59% chance that the first serve is in, and if it is, then the server has a 73% chance of winning the point. If the first serve is out, then they have an 86% of getting the second serve in, and in that case they have a 59% chance of winning the point. But if the second serve is out, then the server automatically loses the point.
| Day | Section | Topic |
|---|---|---|
| Mon, Feb 16 | 5.3 | Hypothesis tests for a proportion |
| Wed, Feb 18 | 6.2 | Difference in two proportions |
| Fri, Feb 20 | 6.2 | Difference in two proportions - con’d |
| Day | Section | Topic |
|---|---|---|
| Mon, Feb 23 | 6.3 | Chi-squared goodness of fit test |
| Wed, Feb 25 | 6.4 | Chi-squared test for association |
| Fri, Feb 27 | 7.1 | One-sample means with t-distribution |
| Day | Section | Topic |
|---|---|---|
| Mon, Mar 2 | 7.2 | Paired data |
| Wed, Mar 4 | 7.3 | Difference of two means |
| Fri, Mar 6 | 7.4 | Power calculations |
| Day | Section | Topic |
|---|---|---|
| Mon, Mar 16 | 7.5 | Comparing many means with ANOVA |
| Wed, Mar 18 | Review | |
| Fri, Mar 20 | Midterm 2 |
| Day | Section | Topic |
|---|---|---|
| Mon, Mar 23 | 7.5 | ANOVA - con’d |
| Wed, Mar 25 | 8.2 | Least squares regression |
| Fri, Mar 27 | 9.1 | Introduction to multiple regression |
| Day | Section | Topic |
|---|---|---|
| Mon, Mar 30 | 9.2 | Model selection |
| Wed, Apr 1 | 9.3 | Checking model conditions |
| Fri, Apr 3 | 9.3 | Checking model conditions - con’d |
| Day | Section | Topic |
|---|---|---|
| Mon, Apr 6 | 9.5 | Introduction to logistic regression |
| Wed, Apr 8 | 9.5 | Logistic regression - con’d |
| Fri, Apr 10 | Hypothesis testing with randomization |
| Day | Section | Topic |
|---|---|---|
| Mon, Apr 13 | Confidence intervals with bootstrapping | |
| Wed, Apr 15 | Review | |
| Fri, Apr 17 | Midterm 3 |
| Day | Section | Topic |
|---|---|---|
| Mon, Apr 20 | Introduction to Bayesian methods | |
| Wed, Apr 22 | Credible intervals for proportions | |
| Fri, Apr 24 | Bayesian inference | |
| Mon, Apr 27 | Last day, recap & review |