Monsters, Models, and Normal Distributions

Princeton University

Author

Jason Geller, PH.D.(he/him)

Published

October 1, 2023

Packages


Outline

  • Thinking about models

  • The normal distribution

  • Z-scores

    • How to compute Z-scores
      • Z-score practice

What is a statistical model?




  • Statistical modeling = “making models of distributions

What is a model?

Models are simplifications of things in the real world

What is a model?

Distributions


Basic Structure of a Model

\[data = model + error\]

  1. Data

  2. Model

    • Use our model to predict the value of the data for any given observation:

      \[\widehat{data_i} = model_i\]

  3. Error (predicted - observed)

\[error_i = data_i - \widehat{data_i}\]

The “hat” over the data denotes that it’s our prediction rather than the actual value of the data.This means that the predicted value of the data for observation is equal to the value of the model for that observation. Once we have a prediction from the model, we can then compute the error:

That is, the error for any observation is the difference between the observed value of the data and the predicted value of the data from the model.

Models as Monsters

  • The Golem of Prague
    • The golem was a powerful clay robot
    • Brought to life by writing emet (“truth”) on its forehead
    • Obeyed commands literally
    • Powerful, but no wisdom
    • In some versions, Rabbi Judah Loew ben Bezalel built a golem to protect
      • But he lost control, causing innocent deaths

Statistical golems

  • Statistical (and scientific) models are our golems
    • We build them from basic parts
    • They are powerful—we can use them to understand the world and make predictions
    • They are animated by “truth” (data), but they themselves are neither true nor false
    • The model describes the golem, not the world
      • The model doesn’t describe the world or tell us what scientific conclusion to draw—that’s on us
    • We need to be careful about how we build, interpret, and apply models

Choosing a Statistical Model

  • Cookbook approach
    • Do Smarties make us smarties?
      • Take 200 7-year-olds
        • Randomly assign to 2 groups
          • Control: Normal breakfast
          • Treatment: Normal breakfast + 1 packet of Smarties
          • Outcome: Age-appropriate general reasoning test
    • What statistical analysis do I run?

Choosing a Statistical Model

  • Cookbook approach

    • Every one of these tests is the same model

    • The general linear model (GLM)

  • The cookbook approach makes it hard to think clearly about relationship between our question and the statistics

The General Linear Model

  • General mathematical framework

    • Regression all the way down

    • Highly flexible

      • Can fit qualitative (categorical) and quantitative predictors
    • Easy to interpret

    • Helps understand interrelatedness to other models

    • Easy to build to more complex models

The General Linear Model

  • Modeling comparison approach

  • Think in terms of models and not tests

  • Model is determined by question, not data

  • What do alternative models say about the world?

  • Let’s build a model for this experiment

A simple model

  • General reasoning scores

A Simple Model: Data


Building a Model - Notation

  • Small Roman letters

    • Individual observed data points

      • \(y_1\), \(y_2\), \(y_3\), \(y_4\), …, \(y_n\)

        • The scores for person 1, person 2, person 3, etc.
      • \(y_i\)

        • The score for the “ith” person
  • Big Roman letters

    • A “random variable”

    • The model for data we could observe, but haven’t yet

  • \(Y_1\)

    • The model for person 1
    • The yet-to-be-observed score of person 1

Building a Model - Notation

  • Greek letters

    • Population parameters

    • Unobservable parameters

  • μ

    • mu

      • “mew” - Used to describe means
  • σ

    • Sigma

    • Used to describe a standard deviation

Building a Model - Notation

  • Roman letters

    • Sample specific statistics

      • \(\bar{X}\) - sample mean

      • s - standard deviation from the sample

    • Data estimates

      • \(b_0\)

      • \(e\)

A simple model

  • Null or empty model

\[ Y_i = \beta_0 + \epsilon \]

\[ Y_i= b_0 + e \]

  • Makes the same prediction for each observation

Figuring out \(b_0\)

  • Goal of any model is to find an estimator that minimizes the error
    • How we define error will determine the best estimator

Types of Errors

Count of Errors

  • It simply counts the number of instances where the prediction was incorrect \[\text{Count} = \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)\]
  • Mode is best estimate for \(b_0\)

Sum of errors (residuals)

  • The sum of the differences between observed values and predicted values. In an ideal case with no bias, this would be zero.

\[SE = \sum_{i=1}^{n} (y_i - \hat{y}_i)\]

Sum of absolute errors

  • Measures the total absolute difference between observed and predicted values. It gives a sense of the average magnitude of errors without considering direction

\[SAE = \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

  • Median is best estimate for \(b_0\)

Sum of squares (SS)

  • This measures the total squared difference between observed and predicted values

  • Most commonly used in regression analysis (what we will be using)

\[SS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The mean

  • Mean is the best estimator of \(b_0\)

    \[\frac{1}{n} \sum_{i=i}^{n} x_{i}\]

  • Mean has really nice proprieties\[SR = \sum_{i=1}^{n} (y_i - \hat{y}_i)\]

    • SSR minimized at mean

SSR minimized at mean

Describing error

  • We should have some overall description of the accuracy of model’s predictions

    • SSR

      • Standard deviation

        \[ s^2 = \text{MSE} = \frac{1}{n-p} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \]

\[ \text{SD} = \sqrt{\text{MSE}} \]

Statistical Modeling: An Example

  • Let’s look at general reasoning scores

Building a model - concrete example


Building a Model - Concrete example

\[ \hat{scores} = b_0 + e \]

  • What is the overall mean of the dataset?

Building a Model - Concrete example

  • lm function in R can fit an empty model

  • \(b_0\) = Intercept = Estimate = Mean

Building a Model - Concrete example

  • broom is a helper package that provides us with lots of useful functions to get things like residuals, predicted values, etc)

Building a model - Concrete example

  • Can get SS a few different ways

Building a model - Concrete

Building a model - Concrete example



Building a model - concrete example

  • Mean squared error (MSE)

Building a model - Concrete example

  • Predictions from the model

A More Complex Model

  • Do you think the empty model is a good model?

. . .


What Makes a Model “Good”

  1. We want it to describe our data well

  2. We want it to generalize to new datasets

    • We want error to be as small as possible

Taken from Poldrack (2023)

Can a Model Be Too Good?

  • Yes!

    • Overfitting

      • A model with little to no error will not generalize to new datasets

Normal distribution

  • Error in linear models is assumed to distributed as normal

\[ \epsilon \sim N(\mu, \sigma) \]

  • Normal is called a Gaussian distribution

  • If we assume a variable is at least normally distributed can make many inferences!

  • Most of the statistical models assume normal distribution

Normal distribution

  • Normal(μ, σ)

    • Parameters:

      • \(\mu\) = Mean

        • Mean is the center of the distribution
  • \(\sigma\) Standard deviation

    • Variance is average squared deviation from the mean
      • \(\sigma = \sqrt(s^2)\)
        • On average, how far is each point from the mean (spread)?

Building a Model - Normal Distribution

  • Properties of a normal distribution

    • Shape
      • Unimodal
      • Symmetric
      • Asymptotic

Building a Model - Normal Distribution

  • The PDF of a normal distribution is given by:

\(f(x) = \frac{1}{\sqrt{2\pi \sigma}}\exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]\)

Normal Distribution

  • Skew

Normal Distribution

Normal Distribution

  • \(Y_1\)\(N(\mu, \sigma)\)

    • \(Y_1\) ∼ Normal(100, 15)

    • \(Y_2\) ∼ Normal(100, 15)

    • \(Y_n\) ∼ Normal(100, 15)

  • Or for all observations,

    • \(Y_i\) ∼ Normal(100, 15)

Normal Distribution

  • Everyone’s score comes from the same distribution

  • The average score should be around 100

  • Scores should be spread out by 15

  • Scores should follow bell-shaped curve

Probability and Standard Normal Distribution: Z-Scores

\[Z(x) = \frac{x - \mu}{\sigma}\]- Z-score /standard score tells us how far away any data point is from the mean, in units of standard deviation

Standard Normal Distribution

  • Properties of standard normal
    • Empirical Rule
      • 68.27% of the data falls within one standard deviation (sigma) of the mean
      • 95.45% falls within two sigma
      • 99.73% falls within three sigma

Z tables

  • NO MORE TABLES

Using R

  • dnorm() : Z-score to density (height) (PDF)
  • pnorm(): Z-score to area (CDF)
  • qnorm(): area to Z-score

Using R

  • pnorm function: Z-score to area (CDF)

\[ P(X <= x) \]

If you calculated a Z-score you can find the probability of a Z-score less than(lower.tail=TRUE) or greater than (lower.tail=FALSE) by using pnorm(Z).

Using R: pnorm


  1. What is the z-score?

  1. What percentage is below this z-score?

  • Above

Using R: pnorm

  • Calculate z-score

  • Percentage below IQ score of 55?
  • Percentage above IQ score of 55?

Using R: pnorm

  • Percentage between IQ score of 120 and 159?

Package PnormGC

  • Percentage between IQ score of 120 and 159?

Package PnormGC

What about \(P(X \leq 69)\)


Using R: qnorm

  • qnorm(): area to z-scores

  • What is the score for which 5% lies above?


Practice pnorm

Suppose that BMI measures for men age 60 in a Heart Study population is normally distributed with a mean (μ) = 29 and standard deviation (σ) = 6. You are asked to compute the probability that a 60 year old man in this population will have a BMI less than 30.

  • What is the z-score?

  • What is the probability a 60 year old man in this population will have a BMI between 30 and 40?


Practice: qnorm

Suppose that SAT scores are normally distributed, and that the mean SAT score is 1000, and the standard deviation of all SAT scores is 100. How high must you score so that only 10% of the population scores higher than you?


Z-scores in practice

  • Standardization
    • Scaling your measures so they are are comparable
    • Does not change anything about the data!

Standardized Scores

  • A standardized score is a Z-score that has been transformed to have a \(\mu\) and \(\sigma\) different from standard normal

    • IQ

      • \(\mu = 100\) \(\sigma = 15\)
    • SAT

      • \(\mu =500\) \(\sigma = 100\)
    • T-score

      • \(\mu = 50\) \(\sigma = 10\)
  • New score = new sd(Z) + New mean