The model for data we could observe, but haven’t yet
\(Y_1\)
The model for person 1
The yet-to-be-observed score of person 1
Building a Model - Notation
Greek letters
Population parameters
Unobservable parameters
μ
mu
“mew” - Used to describe means
σ
Sigma
Used to describe a standard deviation
Building a Model - Notation
Roman letters
Sample specific statistics
\(\bar{X}\) - sample mean
s - standard deviation from the sample
Data estimates
\(b_0\)
A simple model
Null or empty model
\[
Y_i = \beta_0 + \epsilon
\]
\[
Y_i= b_0 + e
\]
Makes the same prediction for each observation
A Simple Model: Data
Scores
101
114
131
9
Figuring out \(b_0\)
Goal of any model is to find an estimator that minimizes the error
How we define error will determine the best estimator
Types of Errors
Sum of errors (residuals)
The sum of the differences between observed values and predicted values. In an ideal case with no bias, this would be zero.
\[SE = \sum_{i=1}^{n} (y_i - \hat{y}_i)\]
Sum of absolute errors
Measures the total absolute difference between observed and predicted values. It gives a sense of the average magnitude of errors without considering direction
\[SAE = \sum_{i=1}^{n} |y_i - \hat{y}_i|\]
Median is best estimate for \(b_0\)
Sum of squares (SS)
This measures the total squared difference between observed and predicted values
Most commonly used in regression analysis (what we will be using)
\[SS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
The mean
Mean is the best estimator of \(b_0\)
\[\frac{1}{n} \sum_{i=i}^{n} x_{i}\]
Mean has really nice proprieties: \[SE = \sum_{i=1}^{n} (y_i - \hat{y}_i)\]
SSR minimized at mean
SSR minimized at mean
Describing error
We should have some overall description of the accuracy of model’s predictions
Suppose that BMI measures for men age 60 in a Heart Study population is normally distributed with a mean (μ) = 29 and standard deviation (σ) = 6. You are asked to compute the probability that a 60 year old man in this population will have a BMI less than 30.
What is the z-score?
What is the probability a 60 year old man in this population will have a BMI between 30 and 40?
#
Practice: qnorm
Suppose that SAT scores are normally distributed, and that the mean SAT score is 1000, and the standard deviation of all SAT scores is 100. How high must you score so that only 10% of the population scores higher than you?