Probablity Distributions and Visualization

Princeton University

Jason Geller, Ph.D.(he/him)

2023-09-24

Outline

  • Probability - what is it?

  • Random processes and variables

  • Probability distributions

  • DGP

  • ggplot primer

  • Visualizing variance

Packages

library(grateful)
pkgs <- cite_packages(output = "table", out.dir = ".")
pkgs
           Package    Version
1             base      4.2.2
2       colorspace      2.1.0
3          cowplot      1.1.1
4  fivethirtyeight      0.6.2
5        gapminder      1.0.0
6           ggdist      3.2.0
7         gghalves      0.1.4
8           ggrain      0.0.3
9         ggridges      0.5.3
10     ggstatsplot      0.9.4
11       MetBrewer      0.2.0
12       openintro      2.4.0
13          pacman      0.5.1
14  palmerpenguins      0.1.1
15       patchwork      1.1.2
16       rmarkdown       2.14
17          scales      1.2.1
18       ThemePark 0.0.0.9000
19       tidyverse      1.3.2
                                               Citation
1                                                 @base
2  @colorspace2009b; @colorspace2009c; @colorspace2020a
3                                              @cowplot
4                                      @fivethirtyeight
5                                            @gapminder
6                                               @ggdist
7                                             @gghalves
8                                               @ggrain
9                                             @ggridges
10                                         @ggstatsplot
11                                           @MetBrewer
12                                           @openintro
13                                              @pacman
14                                      @palmerpenguins
15                                           @patchwork
16       @rmarkdown2018; @rmarkdown2020; @rmarkdown2022
17                                              @scales
18                                           @ThemePark
19                                           @tidyverse

Probability warm-up

  1. What is probability of drawing the ace of spades from a fair deck of cards?

  2. What is the probability of drawing an ace of any suit?

  3. You are going to roll some dice twice. What is the probability you roll doubles?

  4. What is the chance that a live specimen of the New Jersey Devil will be found?

  5. Who is more likely to be a victim of a street robbery, a young man or an old woman?

Frequentist interpretation of probability

  • Classical statistical inference is based off this interpretation
  • The frequentist approach treats probabilities as relative frequencies

    • If we say that the probability of rolling a 5 on a fair die is 1/6, we mean that, if the die were rolled an infinite number of times, 1/6 of all rolls would be a 5

What probability is not

What is probability theory?

  • Probability is the study of random processes

    • Probability is used to characterize uncertainty/randomness

Random processes: intuition

  • Let’s flip a fair coin
coinflips <- function(x) { 
  flip=rbinom(x, 1, 0.5) 
flip=ifelse(flip==1, "Tails", "Heads") 

return(flip) 
}

coinflips(1)
[1] "Tails"
  1. Can you tell me what the outcome will be?

  2. If we were to flip a fair coin many many times, would you be able to tell the proportion of times that we would obtain heads?

  • If answer to first question is “NO” AND Answer to second question is “YES”
    • You are dealing with a random process

Random processes

  • Random processes are mechanisms that produce outcomes… from a world/set of possible outcomes… with some degree of uncertainty but with regularity
  • Random assignment in experiments
  • Random dram of a sample of n individuals from a population of N individuals
  • Rolling a die

What is a random variable?

  • A variable that depends on a random process

    • Think about the following random phenomenon: “randomly selecting 2 students in this class room”

      • Sample space?

        • One possible outcome: \(\omega = \{\mathrm{Ari, \: Alex}\}\)

        • Another possible outcome: \(\omega = \{\mathrm{Kennedy, \: Brooke}\}\)

What is a random variable?

  • Can this be considered a random variable?
  • No. Random variables are always numeric

  • We operate on random variables using math

Illustration

  • Random variables are mappings from events to numbers

  • Formally, a random variable is defined as a function that maps the sample space \(\Omega\) of a random generative process into the real line (or into real numbers)

Probability distribution functions

  • Two types of random variables

    • Discrete

    • Continuous

  • The distribution of a random variable \(X\) describes the likelihood of the values that \(X\) can take

Discrete random variables: definition

  • Discrete random variables are defined on a range that is a countable set

  • i.e., they can only take on a finite or countably infinite number of different values

Probability mass function (PMF)

  • Let \(X\) be a discrete random variable

  • The probability mass function (PMF) of \(X\) summarizes the probability of each outcome \(x\)

  • PMF: function \(p\) given by \[P(X = x)\]

    • \(X\) = RV

    • \(x\) = Outcome

Example: Dessert tonight

Imagine that you started a strict diet a few days ago. You are at a dinner party at your friend’s who made your favorite dessert. You are very tempted and make the decision to flip a coin three times. The number of times that the flip coin returns tail determines the number of bites of the dessert you will have. Before you start flipping the coin, you want to learn more about your chances of not having dessert tonight. To do that, you decide to look at the possible outcomes. You let 𝑋 be the number of times a series of three coin flips returns tails (T). The support of 𝑋 is {0, 1, 2, 3}.

Dessert tonight

outcomes <- c('H', 'T')
sample_space <- expand.grid(outcomes, outcomes, outcomes)
sample_space
  Var1 Var2 Var3
1    H    H    H
2    T    H    H
3    H    T    H
4    T    T    H
5    H    H    T
6    T    H    T
7    H    T    T
8    T    T    T

PMF: Dessert tonight

Cumulative distribution function (CDF): definition

  • The CDF of a random variable is the function \(F\) such that

    • \(F(x) = P(X \leq x)\)
  • PMF tells us probability of each possible outcome

  • CDF tells us the probability that an outcome below a specific value occurs

  • Sums to 1

Continuous random variables: Definition

  • A continuous random variable is a variable that can take on an infinite number of values within a given range or interval

Probability Density Function (PDF): Definition

  • PDF is continuous version of PMF
  • The PDF of a random continuous variable is the function \(F\) such that

\[P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\]

  • PDF tells us the probability of range of outcomes
  • What is probability of observing IQ between 100 and 125?

CDF

  • \(F(x) = P(X \leq x)\)

    • \(IQ \leq 100\)

Summarizing random variables

  • PMFs, PDFs, CDFs are very useful tools to summarize information from rvs.

  • Many other ways to summarize random variables!

    • e.g., mean, median, standard deviation, etc.

Our goal as statisticians

Link probability distributions to the data generating process (DGP)

  • The DGP represents the “real-world” process of how data comes about

  • Probability distributions are mathematical models used to model and understand the DGP

DGP

  • Bottom-up: This approach begins with the observed data. By examining the data distribution, one might make educated guesses or inferences about the underlying processes that produced it

  • Top-down: This approach relies on pre-existing knowledge or theories about the system or phenomenon in question to inform our understanding of the DGP

Bootstrapping

    • Sampling with replacement

      • A computer based method for deriving the probability distribution for any random variable

      • How to do it

        • Run your analysis a bunch of times with a slightly different set of observations each time
    • Dice roll
    x=1:6
    
    sample(x, 6, replace=TRUE)
    [1] 6 3 1 2 1 3
    sample(x, 6, replace=TRUE)
    [1] 4 5 1 5 3 3
    sample(x, 6, replace=TRUE)
    [1] 5 4 3 2 5 1

Bootstrapping

Sources of Variance

  • There is uncertainty associated with the DGP

Randomness

  • Shuffling (permutation tests)

    Rind and Bordia (1996)

Visualizing variance

  • ggplot2 is tidyverse’s data visualization package (plotnine in Python uses similar syntax)

  • The gg in ggplot2 stands for Grammar of Graphics

    • It is inspired by the book Grammar of Graphics by Leland Wilkinson

    • A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

ggplot2

ggplot2

Let’s start with a blank canvas

ggplot()

ggplot2 - Data

library(palmerpenguins)
ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm))+
  theme_minimal(base_size = 16)

ggplot2 - Layers

  • Let’s add a geom

    • geom_point adds a dot for each raw data point
ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
          geom_point()+
  theme_minimal(base_size = 16)

ggplot2 - Layers

  • Let’s add another geom

    • geom_smooth plots a smoothed line for the data
    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
      geom_smooth()+
  theme_minimal(base_size = 16)

ggplot2 Layers

  • Maybe a linear line
    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

ggplot2 - Layers

  • It might be nice to see the raw data WITH the line. We can combine geoms!
    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
  geom_point() + 
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

ggplot2 - Size

  • Points are a bit small. Let’s make them bigger!
    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
  geom_point(size = 2, alpha= .2) + 
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

ggplot2 - Color

  • How could we add information about different types of penguins?
    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
  geom_point(size = 4)+
  theme_minimal(base_size = 16)

ggplot2- Axes

  • Let’s clean up our plot

    • Add clearer axis labels
    title <- ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
  geom_point(size = 4) + 
  xlab("Bill Length in milimters") + 
  ylab("Flipper Length in milimeters")+
  theme_minimal(base_size = 16)

ggplot2 - Title

  • Let’s clean up our plot

  • Add title

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
              geom_point(size = 4) + 
              xlab("Bill Length in milimters") + 
              ylab("Flipper Length in milimeters") + 
             ggtitle("Palmer Penguins: Bill length vs Flipper length") +
      theme_minimal(base_size = 16)

ggplot2 - Themes

theme_plot <-  ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
              geom_point(size = 4) + 
              xlab("Bill Length in milimters") + 
              ylab("Flipper Length in milimeters") + 
              ggtitle("Palmer Penguins: Bill length vs Flipper length") + 
              theme_dark(base_size = 16)
theme_plot

ggplot2 - Color themes

library(MetBrewer)
#| code-line-numbers: "6"
#| fig-align: center

ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
              geom_point(size = 4) + 
              xlab("Bill Length in milimters") + 
              ylab("Flipper Length in milimeters") + 
              ggtitle("Palmer Penguins: Bill length vs Flipper length") + # Changes legend title, and selects a colour-palette
  scale_colour_manual(
                    values = MetBrewer::met.brewer("VanGogh2",3)) + 
              theme_minimal(base_size = 16)

ggplot2 - Themes

# install.packages("remotes")
#remotes::install_github("MatthewBJane/ThemePark")
library("ThemePark")

`X variable` <- rnorm(50, 0, 1)
`Y variable` <- rnorm(50, 0, 1)


ggplot(data = data.frame(x = `X variable`, y = `Y variable`), aes(x = x, y = y)) +
  geom_smooth(method = 'lm', color = barbie_theme_colors["medium"]) +
  geom_point() +
  labs(title = 'Barbie Scatter Plot') +
  theme_barbie()

Visualizing Variance and Relationships

Disclaimer

- More information is always better!

  • Avoid visualizing single numbers when you have a whole distribution of numbers

Histograms

Histograms

  • Put data into equally spaced buckets (or bins), plot how many rows are in each bucket

    library(gapminder)
    gapminder_2002 <- gapminder %>% 
          filter(year == 2002)
    
    ggplot(gapminder_2002,
               aes(x = lifeExp)) +
          geom_histogram()+
          theme_minimal(base_size = 16)

ggdist

  • Put data into equally spaced buckets (or bins), plot how many rows are in each bucket

    library(ggdist)
            gapminder_2002 <- gapminder %>% 
              filter(year == 2002)
    
            fig=ggplot(gapminder_2002,
                   aes(x = lifeExp)) +
              geom_dots()+
          theme_minimal(base_size = 16)

Histograms: Bin width

  • Range of values in each bar or dot

    ggplot(gapminder_2002, aes(x = lifeExp)) +
       geom_histogram(binwidth = .2) +
      theme_minimal(base_size = 16)

    ggplot(gapminder_2002, aes(x = lifeExp)) +
       geom_histogram(binwidth = 20) +
      theme_minimal(base_size = 16)

     ggplot(gapminder_2002, aes(x = lifeExp)) +
          geom_histogram(binwidth = 2) +
      theme_minimal(base_size = 16)

Histogram tips

  • Add a border to the bars
    for readability

    • geom_histogram(..., color = "green")
ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = 2, color="green") +
  theme_minimal(base_size = 16)

Histogram tips

  • Set the boundary
ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = 2, color="white", boundary=50) +
  theme_minimal(base_size = 16)

Density plots

  • Use calculus to find the probability of each x value

    #|code-line-numbers: "2"
    
    
    ggplot(gapminder_2002, aes(x = lifeExp)) +
      geom_density(fill = "grey60", color = "grey30")+
      theme_minimal(base_size = 16)

Density plots: Kernels and bandwidths

  • Different options for calculus change the plot shape

    • Kernels - Smooth data points

    • bandwidth - how wide

ggplot(gapminder_2002, aes(x = lifeExp)) +
  geom_density(fill = "grey60", color = "grey30")+
  theme_minimal(base_size = 16)

Box plots

Box plots

  • Show specific distributional numbers

    box <- ggplot(gapminder_2002,
           aes(x = lifeExp)) +
      geom_boxplot()+
      theme_minimal(base_size = 16)

Five number summary

Categorical Variables

Bar plots

library(fivethirtyeight)
# Using ggplot2 to plot the data
candy_rankings %>%
  # Sorting the data to have the highest percentages at the top
  arrange(-winpercent) %>%
  # Taking the top n candies (optional, you can remove the following line if you want all)
  top_n(20, wt = winpercent)
# A tibble: 20 × 13
   competitorname              chocolate fruity caramel peanutyalmondy nougat
   <chr>                       <lgl>     <lgl>  <lgl>   <lgl>          <lgl> 
 1 Reese's Peanut Butter cup   TRUE      FALSE  FALSE   TRUE           FALSE 
 2 Reese's Miniatures          TRUE      FALSE  FALSE   TRUE           FALSE 
 3 Twix                        TRUE      FALSE  TRUE    FALSE          FALSE 
 4 Kit Kat                     TRUE      FALSE  FALSE   FALSE          FALSE 
 5 Snickers                    TRUE      FALSE  TRUE    TRUE           TRUE  
 6 Reese's pieces              TRUE      FALSE  FALSE   TRUE           FALSE 
 7 Milky Way                   TRUE      FALSE  TRUE    FALSE          TRUE  
 8 Reese's stuffed with pieces TRUE      FALSE  FALSE   TRUE           FALSE 
 9 Peanut butter M&M's         TRUE      FALSE  FALSE   TRUE           FALSE 
10 Nestle Butterfinger         TRUE      FALSE  FALSE   TRUE           FALSE 
11 Peanut M&Ms                 TRUE      FALSE  FALSE   TRUE           FALSE 
12 3 Musketeers                TRUE      FALSE  FALSE   FALSE          TRUE  
13 Starburst                   FALSE     TRUE   FALSE   FALSE          FALSE 
14 100 Grand                   TRUE      FALSE  TRUE    FALSE          FALSE 
15 M&M's                       TRUE      FALSE  FALSE   FALSE          FALSE 
16 Nestle Crunch               TRUE      FALSE  FALSE   FALSE          FALSE 
17 Rolo                        TRUE      FALSE  TRUE    FALSE          FALSE 
18 Milky Way Simply Caramel    TRUE      FALSE  TRUE    FALSE          FALSE 
19 Skittles original           FALSE     TRUE   FALSE   FALSE          FALSE 
20 Hershey's Krackel           TRUE      FALSE  FALSE   FALSE          FALSE 
# ℹ 7 more variables: crispedricewafer <lgl>, hard <lgl>, bar <lgl>,
#   pluribus <lgl>, sugarpercent <dbl>, pricepercent <dbl>, winpercent <dbl>

Bar plots

Exploring quantitative variables

  • Scatter plots

Categorical x Continuous

Exploring multiple groups

  • Visualize the distribution of a
    single variable across groups

  • Add a fill aesthetic or use faceting!

Multiple histograms

  • This looks bad and is hard to read
ggplot(gapminder_2002,
       aes(x = lifeExp,
           fill = continent)) +
  geom_histogram(binwidth = 5, 
                 color = "white", 
                 boundary = 50) +
  guides(fill = "none") +
  theme_minimal(base_size = 16)

Multiple histograms

 ggplot(gapminder_2002,
       aes(x = lifeExp,
           fill = continent)) +
  geom_histogram(binwidth = 5, 
                 color = "white", 
                 boundary = 50) +
  guides(fill = "none") +
  facet_wrap(vars(continent))+
  theme_minimal(base_size = 16)

Pyramid histograms

gapminder_intervals <- gapminder %>% 
  filter(year == 2002) %>% 
  mutate(africa = 
           ifelse(continent == "Africa", 
                  "Africa", 
                  "Not Africa")) %>% 
  mutate(age_buckets = 
           cut(lifeExp, 
               breaks = seq(30, 90, by = 5))) %>% 
  group_by(africa, age_buckets) %>% 
  summarize(total = n())
ggplot(gapminder_intervals, 
       aes(y = age_buckets,
           x = ifelse(africa == "Africa", 
                      total, -total),
           fill = africa)) +
  geom_col(width = 1, color = "white")+
  theme_minimal(base_size = 16)

Multiple densities: Transparency

p4 <- ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(x = lifeExp,
           fill = continent)) +
  geom_density(alpha = 0.2)+
  theme_minimal(base_size = 16)

Multiple densities: Ridge plots

library(ggridges)
ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(x = lifeExp,
           fill = continent,
           y = continent)) +
  geom_density_ridges()+
  theme_minimal(base_size = 16)

Multiple Box plots

  • Boxplots
url <- "https://raw.githubusercontent.com/z3tt/DataViz-Teaching/master/data/weissgerber-data.csv"
data <- read_csv(url)
ggplot(data, aes(x = group, y = value, color = group, fill = group)) +
  scale_y_continuous(breaks = 1:9) +
  geom_boxplot(alpha = .5, size = 1.5, outlier.size = 5) + 
  theme_minimal(base_size = 16) + 
  theme(legend.position =  "none")

Violin plots

  • Density plot rotated 90 degrees and mirrored

    ggplot(data, aes(x = group, y = value)) +
      # create the violin plot
           geom_violin(
            aes(fill = group, fill = after_scale(colorspace::lighten(fill, .5))),
            size = 1.2, bw = .8
          ) +
      theme_minimal(base_size = 16)+
      # delete the legend information
            theme(legend.position =  "none")

Are violin plots bad?

Half violin plots + Box

# Create the plot
ggplot(data, aes(x = group, y = value, fill=group)) +
  # Half violin plot with ggdist
  stat_halfeye()+
  # Distributional boxplot (similar to standard boxplot but with more flexibility)
  geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white")+
  theme_minimal(base_size = 16) + 
    theme(legend.position = "none")

Strip plots

ggplot(data, aes(x = group, y = value, fill=group)) +
  # get the jitter points
   geom_jitter(position=position_jitter(0.2)) + 
  #removed legend
  theme_minimal(base_size = 16) + 
  theme(legend.position = "none")

Strip plots

  • Add summary stats

    ggplot(data, aes(x = group, y = value, fill=group)) +
       geom_jitter(position=position_jitter(0.2)) + 
       stat_summary(fun.y=median, geom="point", shape=18,
                     size=3, color="red")+
      #removed legend
      theme_minimal(base_size = 16) + 
        theme(legend.position = "none")

Raincloud plots

Half violin plots + box + raw points

library(ggrain)
ggplot(data, aes(x = group, y = value, fill=group)) +
  # raincloud plots
  geom_rain() + 
  #removed legend
  theme_minimal(base_size = 16) + 
  theme(legend.position = "none")

Raincloud Plots

ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(y = lifeExp,
           x = continent,
           color = continent)) + 
  geom_rain() + 
  coord_flip()+
  theme_minimal(base_size = 16)

Multiple Geoms

  • Multiple Geoms
library(gghalves)
p6 <- ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(y = lifeExp,
           x = continent,
           color = continent)) +
  geom_half_boxplot(side = "l") +
  geom_half_point(side = "r") + theme(legend.position  = "none")+
  theme_minimal(base_size = 16)

Categorical vs. Categorical

  • Stacked bar plots

Combining plots

patchwork

library(patchwork)

p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point()

p2 <- ggplot(penguins, aes(x=bill_depth_mm,y=body_mass_g )) + 
  geom_point()

p1 + p2

cowplot

library(cowplot)

p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point()

p2 <- ggplot(penguins, aes(x=bill_depth_mm,y=body_mass_g )) + 
  geom_point()

plot_grid(p1, p2, labels = c('A', 'B'))

Cited

Rind, Bruce, and Prashant Bordia. 1996. “Effect on Restaurant Tipping of Male and Female Servers Drawing a Happy, Smiling Face on the Backs of Customers’ Checks.” Journal of Applied Social Psychology 26 (3): 218–25. https://doi.org/10.1111/j.1559-1816.1996.tb01847.x.