prop_red <-as.data.frame(prop_red)ggplot(prop_red, aes(x = prop_red)) +geom_histogram(binwidth =0.05, boundary =0.4, color ="white") +labs(x ="Proportion of 50 balls that were red", title ="Distribution of proportions red")
ggplot(virtual_prop_red, aes(x = prop_red)) +geom_histogram(binwidth =0.05, boundary =0.4, color ="white") +geom_vline(xintercept =.37, colour="green", linetype ="longdash")+labs(x ="Proportion of 50 balls that were red", title ="Distribution of 10,000 proportions red") +theme_minimal(base_size =16)
Key points
Two samples from the same population will tend to have somewhat different means
Conversely, two different sample means does NOT mean that they come from different populations
The variance of the sampling distribution of means gets smaller as the sample size increases
Mores samples give better estimate of population mean
Sampling Distribution
The probability distribution of a given statistic (e.g., mean) taken from a random sample
Distribution of statistics (e.g., mean) that would be produced in an infinite repeated random sampling (with replacement) (in theory)
IMPORTANT: Can be any statistic (proportion, mean, standard deviation)
Constructing Sampling Distribution
Randomly draw n sample points from a finite population with size N
Compute statistic of interest
List different observed values of the statistic with their corresponding frequencies
Properties of Estimators
Unbiased
Parameter value = estimate from sampling distribution
Efficient
More precise
Consistency
Sampling distribution becomes narrower if we increase sample size
Example: Sampling Distribution of the Mean
Scores on a statistics test
Taken from Andy Field’s “Adventures in Statistics”
Sampling Error (Standard Error)
Sampling Error (Standard Error)
Say it with me: The standard deviation of the sampling distribution is the standard error
It tells us how much variability there is between sample estimate and population parameter
\[SEM = \sigma/\sqrt(N)\]
SEM
What does smaller SEM tell us about our estimate?
Estimate is likely to be closer to population mean
Sampling Distributions
Note
Sampling distributions are theoretical, and the researcher does not select an infinite number of samples. Instead, they conduct repeated sampling from a larger population, and use the central limit theorem to build the sampling distribution
A Tale of a Theorem and a Law: Magic
Central Limit Theorem
Properties:
The distribution of the sample mean approaches the normal distribution as the sample size increases
The standard deviation of the sampling distribution will be equal to the SD of the population divided by the square root of the sample size.
\[s= \sigma/\sqrt(n)\]Important: about the shape of distribution
Central Limit Theorem
Why is the CLT so important to the study of sampling distribution?
Kicks in regardless of the distribution of the random variable
We can use the normal distribution to tell us how far off our own sample mean is from all other possible means, and use this to inform decisions and estimates in null hypothesis statistical testing
Central Limit Theorem
Certain conditions must be met for the CLT to apply:
Independence: Sampled observations must be independent. This is difficult to verify, but is more likely if
Random sampling / assignment is used
Sample size / skew: Either the population distribution is normal, or if the population distribution is skewed, the sample size is large (> 30)
The more skewed the population distribution, the larger sample size we need for the CLT to apply
Real Data
data("gapminder", package ="dslabs") gapminder_2015 <- gapminder %>%filter(year ==2015, !is.na(infant_mortality))ggplot(gapminder_2015) +geom_histogram(aes(x = infant_mortality), color ="black") +xlab("Infant mortality per 1,000 live births") +ylab("Number of countries")
CI dance: One from the dance (most likely captures the parameter we are estimating, but it could not)
General interpretation: They tell us if I collected a 100 samples in an identical way, and for each of them calculated the mean and CI, then we would expect 95% of these samples to contain true mean
CI Interpretations
2. Precision: MoE gives us a sense of precision
Prediction: CIs give useful information about replication
E.g., If a replication is done, 83% chance (5 in 6 chance) 95% CI will capture the next mean
1
CIs and P-values
Note
We can use CIs to test the significance of the effect. If they do not include 0, we can usually say p < .05