Probablity Distributions and Visualization

Princeton University

Jason Geller, Ph.D.(he/him)

2023-09-24

Outline

Probability - what is it?
Random processes and variables
Probability distributions
DGP
ggplot primer
Visualizing variance

Packages

library(grateful)
pkgs <- cite_packages(output = "table", out.dir = ".")
pkgs

           Package    Version
1             base      4.2.2
2       colorspace      2.1.0
3          cowplot      1.1.1
4  fivethirtyeight      0.6.2
5        gapminder      1.0.0
6           ggdist      3.2.0
7         gghalves      0.1.4
8           ggrain      0.0.3
9         ggridges      0.5.3
10     ggstatsplot      0.9.4
11       MetBrewer      0.2.0
12       openintro      2.4.0
13          pacman      0.5.1
14  palmerpenguins      0.1.1
15       patchwork      1.1.2
16       rmarkdown       2.14
17          scales      1.2.1
18       ThemePark 0.0.0.9000
19       tidyverse      1.3.2
                                               Citation
1                                                 @base
2  @colorspace2009b; @colorspace2009c; @colorspace2020a
3                                              @cowplot
4                                      @fivethirtyeight
5                                            @gapminder
6                                               @ggdist
7                                             @gghalves
8                                               @ggrain
9                                             @ggridges
10                                         @ggstatsplot
11                                           @MetBrewer
12                                           @openintro
13                                              @pacman
14                                      @palmerpenguins
15                                           @patchwork
16       @rmarkdown2018; @rmarkdown2020; @rmarkdown2022
17                                              @scales
18                                           @ThemePark
19                                           @tidyverse

Probability warm-up

What is probability of drawing the ace of spades from a fair deck of cards?
What is the probability of drawing an ace of any suit?
You are going to roll some dice twice. What is the probability you roll doubles?
What is the chance that a live specimen of the New Jersey Devil will be found?
Who is more likely to be a victim of a street robbery, a young man or an old woman?

Frequentist interpretation of probability

Classical statistical inference is based off this interpretation

The frequentist approach treats probabilities as relative frequencies
- If we say that the probability of rolling a 5 on a fair die is 1/6, we mean that, if the die were rolled an infinite number of times, 1/6 of all rolls would be a 5

What probability is not

What is probability theory?

Probability is the study of random processes
- Probability is used to characterize uncertainty/randomness

Random processes: intuition

Let’s flip a fair coin

coinflips <- function(x) { 
  flip=rbinom(x, 1, 0.5) 
flip=ifelse(flip==1, "Tails", "Heads") 

return(flip) 
}

coinflips(1)

[1] "Tails"

Can you tell me what the outcome will be?
If we were to flip a fair coin many many times, would you be able to tell the proportion of times that we would obtain heads?

If answer to first question is “NO” AND Answer to second question is “YES”
- You are dealing with a random process

Random processes

Random processes are mechanisms that produce outcomes… from a world/set of possible outcomes… with some degree of uncertainty but with regularity

Random assignment in experiments

Random dram of a sample of n individuals from a population of N individuals

Rolling a die

What is a random variable?

A variable that depends on a random process
- Think about the following random phenomenon: “randomly selecting 2 students in this class room”
  - Sample space?
    - One possible outcome: \(\omega = \{\mathrm{Ari, \: Alex}\}\)
    - Another possible outcome: \(\omega = \{\mathrm{Kennedy, \: Brooke}\}\)

What is a random variable?

Can this be considered a random variable?

No. Random variables are always numeric
We operate on random variables using math

Illustration

Random variables are mappings from events to numbers
Formally, a random variable is defined as a function that maps the sample space \(\Omega\) of a random generative process into the real line (or into real numbers)

Probability distribution functions

Two types of random variables
- Discrete
- Continuous
The distribution of a random variable \(X\) describes the likelihood of the values that \(X\) can take

Discrete random variables: definition

Discrete random variables are defined on a range that is a countable set
i.e., they can only take on a finite or countably infinite number of different values

Probability mass function (PMF)

Let \(X\) be a discrete random variable
The probability mass function (PMF) of \(X\) summarizes the probability of each outcome \(x\)
PMF: function \(p\) given by \[P(X = x)\]
- \(X\) = RV
- \(x\) = Outcome

Example: Dessert tonight

Imagine that you started a strict diet a few days ago. You are at a dinner party at your friend’s who made your favorite dessert. You are very tempted and make the decision to flip a coin three times. The number of times that the flip coin returns tail determines the number of bites of the dessert you will have. Before you start flipping the coin, you want to learn more about your chances of not having dessert tonight. To do that, you decide to look at the possible outcomes. You let 𝑋 be the number of times a series of three coin flips returns tails (T). The support of 𝑋 is {0, 1, 2, 3}.

Dessert tonight

outcomes <- c('H', 'T')
sample_space <- expand.grid(outcomes, outcomes, outcomes)
sample_space

  Var1 Var2 Var3
1    H    H    H
2    T    H    H
3    H    T    H
4    T    T    H
5    H    H    T
6    T    H    T
7    H    T    T
8    T    T    T

PMF: Dessert tonight

Cumulative distribution function (CDF): definition

The CDF of a random variable is the function \(F\) such that
- \(F(x) = P(X \leq x)\)
PMF tells us probability of each possible outcome
CDF tells us the probability that an outcome below a specific value occurs
Sums to 1

Continuous random variables: Definition

A continuous random variable is a variable that can take on an infinite number of values within a given range or interval

Probability Density Function (PDF): Definition

PDF is continuous version of PMF
The PDF of a random continuous variable is the function \(F\) such that

\[P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\]

PDF tells us the probability of range of outcomes

What is probability of observing IQ between 100 and 125?

CDF

\(F(x) = P(X \leq x)\)
- \(IQ \leq 100\)

Summarizing random variables

PMFs, PDFs, CDFs are very useful tools to summarize information from rvs.
Many other ways to summarize random variables!
- e.g., mean, median, standard deviation, etc.

Our goal as statisticians

Link probability distributions to the data generating process (DGP)

The DGP represents the “real-world” process of how data comes about
Probability distributions are mathematical models used to model and understand the DGP

DGP

Bottom-up: This approach begins with the observed data. By examining the data distribution, one might make educated guesses or inferences about the underlying processes that produced it
Top-down: This approach relies on pre-existing knowledge or theories about the system or phenomenon in question to inform our understanding of the DGP

Bootstrapping

- Sampling with replacement
  - A computer based method for deriving the probability distribution for any random variable
  - How to do it
    
    Run your analysis a bunch of times with a slightly different set of observations each time
- Dice roll
x=1:6 sample(x, 6, replace=TRUE)
[1] 6 3 1 2 1 3
sample(x, 6, replace=TRUE)
[1] 4 5 1 5 3 3
sample(x, 6, replace=TRUE)
[1] 5 4 3 2 5 1

Bootstrapping

Sources of Variance

There is uncertainty associated with the DGP

Types of Error
Unexplained Error
Sources of Unexplained Error

Randomness

Shuffling (permutation tests)

Rind and Bordia (1996)

Visualizing variance

ggplot2 is tidyverse’s data visualization package (plotnine in Python uses similar syntax)
The gg in ggplot2 stands for Grammar of Graphics
- It is inspired by the book Grammar of Graphics by Leland Wilkinson
- A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

`ggplot2`

Let’s start with a blank canvas

ggplot()

`ggplot2` - Data

library(palmerpenguins)
ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm))+
  theme_minimal(base_size = 16)

`ggplot2` - Layers

Let’s add a geom
- geom_point adds a dot for each raw data point

ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
          geom_point()+
  theme_minimal(base_size = 16)

`ggplot2` - Layers

Let’s add another geom
- geom_smooth plots a smoothed line for the data

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
      geom_smooth()+
  theme_minimal(base_size = 16)

`ggplot2` Layers

Maybe a linear line

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

`ggplot2` - Layers

It might be nice to see the raw data WITH the line. We can combine geoms!

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
  geom_point() + 
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

`ggplot2` - Size

Points are a bit small. Let’s make them bigger!

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm)) +
  geom_point(size = 2, alpha= .2) + 
      geom_smooth(method="lm")+
  theme_minimal(base_size = 16)

`ggplot2` - Color

How could we add information about different types of penguins?

    ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
  geom_point(size = 4)+
  theme_minimal(base_size = 16)

`ggplot2`- Axes

Let’s clean up our plot
- Add clearer axis labels

    title <- ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
  geom_point(size = 4) + 
  xlab("Bill Length in milimters") + 
  ylab("Flipper Length in milimeters")+
  theme_minimal(base_size = 16)

`ggplot2` - Title

Let’s clean up our plot

Add title

ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
          geom_point(size = 4) + 
          xlab("Bill Length in milimters") + 
          ylab("Flipper Length in milimeters") + 
         ggtitle("Palmer Penguins: Bill length vs Flipper length") +
  theme_minimal(base_size = 16)

`ggplot2` - Themes

theme_plot <-  ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
              geom_point(size = 4) + 
              xlab("Bill Length in milimters") + 
              ylab("Flipper Length in milimeters") + 
              ggtitle("Palmer Penguins: Bill length vs Flipper length") + 
              theme_dark(base_size = 16)
theme_plot

`ggplot2` - Color themes

library(MetBrewer)
#| code-line-numbers: "6"
#| fig-align: center

ggplot(data=penguins, mapping = aes(x=bill_length_mm, y = flipper_length_mm, color=species)) +
              geom_point(size = 4) + 
              xlab("Bill Length in milimters") + 
              ylab("Flipper Length in milimeters") + 
              ggtitle("Palmer Penguins: Bill length vs Flipper length") + # Changes legend title, and selects a colour-palette
  scale_colour_manual(
                    values = MetBrewer::met.brewer("VanGogh2",3)) + 
              theme_minimal(base_size = 16)

`ggplot2` - Themes

# install.packages("remotes")
#remotes::install_github("MatthewBJane/ThemePark")

library("ThemePark")

`X variable` <- rnorm(50, 0, 1)
`Y variable` <- rnorm(50, 0, 1)


ggplot(data = data.frame(x = `X variable`, y = `Y variable`), aes(x = x, y = y)) +
  geom_smooth(method = 'lm', color = barbie_theme_colors["medium"]) +
  geom_point() +
  labs(title = 'Barbie Scatter Plot') +
  theme_barbie()

Visualizing Variance and Relationships

Disclaimer

- More information is always better!

Avoid visualizing single numbers when you have a whole distribution of numbers

Histograms

Put data into equally spaced buckets (or bins), plot how many rows are in each bucket

library(gapminder)
gapminder_2002 <- gapminder %>% 
      filter(year == 2002)

ggplot(gapminder_2002,
           aes(x = lifeExp)) +
      geom_histogram()+
      theme_minimal(base_size = 16)

`ggdist`

Put data into equally spaced buckets (or bins), plot how many rows are in each bucket

library(ggdist)
        gapminder_2002 <- gapminder %>% 
          filter(year == 2002)

        fig=ggplot(gapminder_2002,
               aes(x = lifeExp)) +
          geom_dots()+
      theme_minimal(base_size = 16)

Histograms: Bin width

Range of values in each bar or dot

Narrow
Big
Perfect

ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = .2) +
  theme_minimal(base_size = 16)

ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = 20) +
  theme_minimal(base_size = 16)

 ggplot(gapminder_2002, aes(x = lifeExp)) +
      geom_histogram(binwidth = 2) +
  theme_minimal(base_size = 16)

Histogram tips

Add a border to the bars
for readability
- geom_histogram(..., color = "green")

ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = 2, color="green") +
  theme_minimal(base_size = 16)

Histogram tips

Set the boundary

ggplot(gapminder_2002, aes(x = lifeExp)) +
   geom_histogram(binwidth = 2, color="white", boundary=50) +
  theme_minimal(base_size = 16)

Density plots

Use calculus to find the probability of each x value

#|code-line-numbers: "2"


ggplot(gapminder_2002, aes(x = lifeExp)) +
  geom_density(fill = "grey60", color = "grey30")+
  theme_minimal(base_size = 16)

Density plots: Kernels and bandwidths

Different options for calculus change the plot shape
- Kernels - Smooth data points
- bandwidth - how wide

ggplot(gapminder_2002, aes(x = lifeExp)) +
  geom_density(fill = "grey60", color = "grey30")+
  theme_minimal(base_size = 16)

Box plots

Show specific distributional numbers

box <- ggplot(gapminder_2002,
       aes(x = lifeExp)) +
  geom_boxplot()+
  theme_minimal(base_size = 16)

Five number summary

Categorical Variables

Bar plots

library(fivethirtyeight)
# Using ggplot2 to plot the data
candy_rankings %>%
  # Sorting the data to have the highest percentages at the top
  arrange(-winpercent) %>%
  # Taking the top n candies (optional, you can remove the following line if you want all)
  top_n(20, wt = winpercent)

# A tibble: 20 × 13
   competitorname              chocolate fruity caramel peanutyalmondy nougat
   <chr>                       <lgl>     <lgl>  <lgl>   <lgl>          <lgl> 
 1 Reese's Peanut Butter cup   TRUE      FALSE  FALSE   TRUE           FALSE 
 2 Reese's Miniatures          TRUE      FALSE  FALSE   TRUE           FALSE 
 3 Twix                        TRUE      FALSE  TRUE    FALSE          FALSE 
 4 Kit Kat                     TRUE      FALSE  FALSE   FALSE          FALSE 
 5 Snickers                    TRUE      FALSE  TRUE    TRUE           TRUE  
 6 Reese's pieces              TRUE      FALSE  FALSE   TRUE           FALSE 
 7 Milky Way                   TRUE      FALSE  TRUE    FALSE          TRUE  
 8 Reese's stuffed with pieces TRUE      FALSE  FALSE   TRUE           FALSE 
 9 Peanut butter M&M's         TRUE      FALSE  FALSE   TRUE           FALSE 
10 Nestle Butterfinger         TRUE      FALSE  FALSE   TRUE           FALSE 
11 Peanut M&Ms                 TRUE      FALSE  FALSE   TRUE           FALSE 
12 3 Musketeers                TRUE      FALSE  FALSE   FALSE          TRUE  
13 Starburst                   FALSE     TRUE   FALSE   FALSE          FALSE 
14 100 Grand                   TRUE      FALSE  TRUE    FALSE          FALSE 
15 M&M's                       TRUE      FALSE  FALSE   FALSE          FALSE 
16 Nestle Crunch               TRUE      FALSE  FALSE   FALSE          FALSE 
17 Rolo                        TRUE      FALSE  TRUE    FALSE          FALSE 
18 Milky Way Simply Caramel    TRUE      FALSE  TRUE    FALSE          FALSE 
19 Skittles original           FALSE     TRUE   FALSE   FALSE          FALSE 
20 Hershey's Krackel           TRUE      FALSE  FALSE   FALSE          FALSE 
# ℹ 7 more variables: crispedricewafer <lgl>, hard <lgl>, bar <lgl>,
#   pluribus <lgl>, sugarpercent <dbl>, pricepercent <dbl>, winpercent <dbl>

Bar plots

Exploring quantitative variables

Scatter plots

Categorical x Continuous

Exploring multiple groups

Visualize the distribution of a
single variable across groups
Add a fill aesthetic or use faceting!

Multiple histograms

This looks bad and is hard to read

ggplot(gapminder_2002,
       aes(x = lifeExp,
           fill = continent)) +
  geom_histogram(binwidth = 5, 
                 color = "white", 
                 boundary = 50) +
  guides(fill = "none") +
  theme_minimal(base_size = 16)

Multiple histograms

 ggplot(gapminder_2002,
       aes(x = lifeExp,
           fill = continent)) +
  geom_histogram(binwidth = 5, 
                 color = "white", 
                 boundary = 50) +
  guides(fill = "none") +
  facet_wrap(vars(continent))+
  theme_minimal(base_size = 16)

Pyramid histograms

gapminder_intervals <- gapminder %>% 
  filter(year == 2002) %>% 
  mutate(africa = 
           ifelse(continent == "Africa", 
                  "Africa", 
                  "Not Africa")) %>% 
  mutate(age_buckets = 
           cut(lifeExp, 
               breaks = seq(30, 90, by = 5))) %>% 
  group_by(africa, age_buckets) %>% 
  summarize(total = n())

ggplot(gapminder_intervals, 
       aes(y = age_buckets,
           x = ifelse(africa == "Africa", 
                      total, -total),
           fill = africa)) +
  geom_col(width = 1, color = "white")+
  theme_minimal(base_size = 16)

Multiple densities: Transparency

p4 <- ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(x = lifeExp,
           fill = continent)) +
  geom_density(alpha = 0.2)+
  theme_minimal(base_size = 16)

Multiple densities: Ridge plots

library(ggridges)
ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(x = lifeExp,
           fill = continent,
           y = continent)) +
  geom_density_ridges()+
  theme_minimal(base_size = 16)

Multiple Box plots

Boxplots

url <- "https://raw.githubusercontent.com/z3tt/DataViz-Teaching/master/data/weissgerber-data.csv"
data <- read_csv(url)

ggplot(data, aes(x = group, y = value, color = group, fill = group)) +
  scale_y_continuous(breaks = 1:9) +
  geom_boxplot(alpha = .5, size = 1.5, outlier.size = 5) + 
  theme_minimal(base_size = 16) + 
  theme(legend.position =  "none")

Violin plots

Density plot rotated 90 degrees and mirrored

ggplot(data, aes(x = group, y = value)) +
  # create the violin plot
       geom_violin(
        aes(fill = group, fill = after_scale(colorspace::lighten(fill, .5))),
        size = 1.2, bw = .8
      ) +
  theme_minimal(base_size = 16)+
  # delete the legend information
        theme(legend.position =  "none")

Are violin plots bad?

Half violin plots + Box

# Create the plot
ggplot(data, aes(x = group, y = value, fill=group)) +
  # Half violin plot with ggdist
  stat_halfeye()+
  # Distributional boxplot (similar to standard boxplot but with more flexibility)
  geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white")+
  theme_minimal(base_size = 16) + 
    theme(legend.position = "none")

Strip plots

ggplot(data, aes(x = group, y = value, fill=group)) +
  # get the jitter points
   geom_jitter(position=position_jitter(0.2)) + 
  #removed legend
  theme_minimal(base_size = 16) + 
  theme(legend.position = "none")

Strip plots

Add summary stats

ggplot(data, aes(x = group, y = value, fill=group)) +
   geom_jitter(position=position_jitter(0.2)) + 
   stat_summary(fun.y=median, geom="point", shape=18,
                 size=3, color="red")+
  #removed legend
  theme_minimal(base_size = 16) + 
    theme(legend.position = "none")

Raincloud plots

Half violin plots + box + raw points

library(ggrain)
ggplot(data, aes(x = group, y = value, fill=group)) +
  # raincloud plots
  geom_rain() + 
  #removed legend
  theme_minimal(base_size = 16) + 
  theme(legend.position = "none")

Raincloud Plots

ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(y = lifeExp,
           x = continent,
           color = continent)) + 
  geom_rain() + 
  coord_flip()+
  theme_minimal(base_size = 16)

Multiple Geoms

Multiple Geoms

library(gghalves)
p6 <- ggplot(filter(gapminder_2002, 
              continent != "Oceania"),
       aes(y = lifeExp,
           x = continent,
           color = continent)) +
  geom_half_boxplot(side = "l") +
  geom_half_point(side = "r") + theme(legend.position  = "none")+
  theme_minimal(base_size = 16)

Categorical vs. Categorical

Stacked bar plots

Combining plots

`patchwork`

library(patchwork)

p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point()

p2 <- ggplot(penguins, aes(x=bill_depth_mm,y=body_mass_g )) + 
  geom_point()

p1 + p2

`cowplot`

library(cowplot)

p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point()

p2 <- ggplot(penguins, aes(x=bill_depth_mm,y=body_mass_g )) + 
  geom_point()

plot_grid(p1, p2, labels = c('A', 'B'))

Cited

Rind, Bruce, and Prashant Bordia. 1996. “Effect on Restaurant Tipping of Male and Female Servers Drawing a Happy, Smiling Face on the Backs of Customers’ Checks.” Journal of Applied Social Psychology 26 (3): 218–25. https://doi.org/10.1111/j.1559-1816.1996.tb01847.x.

Probablity Distributions and Visualization

Outline

Packages

Probability warm-up

Frequentist interpretation of probability

What probability is not

What is probability theory?

Random processes: intuition

Random processes

What is a random variable?

What is a random variable?

Illustration

Probability distribution functions

Discrete random variables: definition

Probability mass function (PMF)

Example: Dessert tonight

Dessert tonight

PMF: Dessert tonight

Cumulative distribution function (CDF): definition

Continuous random variables: Definition

Probability Density Function (PDF): Definition

CDF

Summarizing random variables

Our goal as statisticians

DGP

Bootstrapping

Bootstrapping

Sources of Variance

Randomness

Visualizing variance

ggplot2

ggplot2

ggplot2 - Data

ggplot2 - Layers

ggplot2 - Layers

ggplot2 Layers

ggplot2 - Layers

ggplot2 - Size

ggplot2 - Color

ggplot2- Axes

ggplot2 - Title

ggplot2 - Themes

ggplot2 - Color themes

ggplot2 - Themes

Visualizing Variance and Relationships

Disclaimer

Histograms

Histograms

ggdist

Histograms: Bin width

Histogram tips

Histogram tips

Density plots

Density plots: Kernels and bandwidths

Box plots

Box plots

Five number summary

Categorical Variables

Bar plots

Bar plots

Exploring quantitative variables

Categorical x Continuous

Exploring multiple groups

Multiple histograms

Multiple histograms

Pyramid histograms

Multiple densities: Transparency

Multiple densities: Ridge plots

Multiple Box plots

Violin plots

Are violin plots bad?

Half violin plots + Box

Strip plots

Strip plots

Raincloud plots

Raincloud Plots

Multiple Geoms

Categorical vs. Categorical

Combining plots

patchwork

`ggplot2`

`ggplot2`

`ggplot2` - Data

`ggplot2` - Layers

`ggplot2` - Layers

`ggplot2` Layers

`ggplot2` - Layers

`ggplot2` - Size

`ggplot2` - Color

`ggplot2`- Axes

`ggplot2` - Title

`ggplot2` - Themes

`ggplot2` - Color themes

`ggplot2` - Themes

`ggdist`

`patchwork`

`cowplot`