Andrew Heiss's blog

How to use a histogram as a legend in {ggplot2}

Andrew Heiss — Wed, 19 Feb 2025 05:00:00 GMT

On Bluesky the other day, I came across this neat post that suggested using a histogram as a plot legend to provide additional context for the data being shown:

Joey Cherdarchuk’s original post

Here’s a closer comparison of those two maps (click to zoom):

Joey Cherdarchuk’s maps side-by-side

This histogram legend is especially useful for choropleth maps where units like counties are sized differently, which can create an illusion of a different distribution. For instance, in that original post, larger dark blue areas stand out a lot visually—like in Alaska, New Mexico, Arizona, and Central California—and make it seem like unemployment is fairly high.

But looking at the histogram that’s not actually the case. Most counties have an unemployment rate around 3–6%. This illusion is happening because land isn’t unemployed—people are.

I thought this was a cool approach, so I figured I’d try to replicate it with R. In the original post, the map was created with D3, the bar chart legend was created with Excel, and the two were combined with Figma. That process is a little too manual for me, but with the magic of R, {ggplot2}, and {patchwork}, we can create the same map completely programmatically.

Let’s do it!

Clean and join data

First, let’s load some packages and tweak some theme settings:

library(tidyverse)
library(readxl)
library(sf)
library(tigris)
library(patchwork)

# Add some font settings to theme_void()
theme_fancy_map <- function() {
  theme_void(base_family = "IBM Plex Sans") +
    theme(
      plot.title = element_text(face = "bold", hjust = 0.13, size = rel(1.4)),
      plot.subtitle = element_text(hjust = 0.13, size = rel(1.1)),
      plot.caption = element_text(hjust = 0.13, size = rel(0.8), color = "grey50"),
    )
}

BLS unemployment data

Next, we can get 2016 unemployment data from the Bureau of Labor Statistics. BLS offers county-level data on annual average labor force participation here, both as plain text and Excel files. The plain text data is structured a little goofily (it’s not comma-separated; it’s a fixed width format where column headings span multiple lines), but the Excel version is in nice columns and is easier to work with. Though even then, we need to skip the first few rows, and the last few rows, and specify column names ourselves.

Download this first from the BLS:

Labor force data by county, 2016 annual averages (XLS)

For the sake of mapping, we’ll truncate the unemployment rate at 9% and mark any counties with higher than 9% unemployment with 9.1 and modify the legend to show “>9%”:

# Load BLS data and clean it up
bls_2016 <- read_excel(
  "laucnty16.xlsx",
  skip = 5,
  col_names = c(
    "laus_code", "STATEFP", "COUNTYFP", "county_name_state",
    "year", "nothing", "labor_force", "employed", "unemployed", "unemp"
  )
) |> 
  # The last few rows in the Excel file aren't actually data, but extra notes,
  # so drop those rows here since they don't have a state FIPS code
  drop_na(STATEFP) |> 
  mutate(
    # Truncate the unemployment rate at 9
    unemp_truncated = ifelse(unemp > 9, 9.1, unemp),
    # Find difference from Fed target of 4%
    unemp_diff = unemp_truncated - 4
  )

bls_2016
## # A tibble: 3,219 × 12
##    laus_code       STATEFP COUNTYFP county_name_state   year  nothing labor_force employed unemployed unemp unemp_truncated unemp_diff
##                                                                           
##  1 CN0100100000000 01      001      Autauga County, AL  2016  NA            25710    24395       1315   5.1             5.1        1.1
##  2 CN0100300000000 01      003      Baldwin County, AL  2016  NA            89778    84972       4806   5.4             5.4        1.4
##  3 CN0100500000000 01      005      Barbour County, AL  2016  NA             8334     7638        696   8.4             8.4        4.4
##  4 CN0100700000000 01      007      Bibb County, AL     2016  NA             8539     7986        553   6.5             6.5        2.5
##  5 CN0100900000000 01      009      Blount County, AL   2016  NA            24380    23061       1319   5.4             5.4        1.4
##  6 CN0101100000000 01      011      Bullock County, AL  2016  NA             4785     4457        328   6.9             6.9        2.9
##  7 CN0101300000000 01      013      Butler County, AL   2016  NA             9116     8484        632   6.9             6.9        2.9
##  8 CN0101500000000 01      015      Calhoun County, AL  2016  NA            45450    42470       2980   6.6             6.6        2.6
##  9 CN0101700000000 01      017      Chambers County, AL 2016  NA            14858    14044        814   5.5             5.5        1.5
## 10 CN0101900000000 01      019      Cherokee County, AL 2016  NA            11241    10671        570   5.1             5.1        1.1
## # ℹ 3,209 more rows

Census geographic data

Next we’ll get geographic data from the US Census with {tigris}

Backup data source

At the time of this writing, {tigris} is working. It wasn’t working a couple weeks ago as the wildly illegal Department of Government Efficiency rampaged through different federal agencies—including the US Census—and shut down the Census’s GIS APIs. But it seems to be working for now?

If it’s not working, IPUMS’s NHGIS project offers the same shapefiles.

The BLS data and the Census data each have columns with state and county FIPS codes which we can use to join the two datasets:

# Get county and state shapefiles from Tigris
us_counties <- counties(year = 2016, cb = TRUE) |> 
  filter(as.numeric(STATEFP) <= 56) |> 
  shift_geometry()  # Move AK and HI

us_states <- states(year = 2016, cb = TRUE) |> 
  filter(as.numeric(STATEFP) <= 56) |> 
  shift_geometry()  # Move AK and HI

# Join BLS data to the map
counties_with_unemp <- us_counties |>
  left_join(bls_2016, by = join_by(STATEFP, COUNTYFP))

# Check out the joined data
counties_with_unemp |> 
  select(STATEFP, COUNTYFP, county_name_state, unemp_truncated, geometry)
## Simple feature collection with 3142 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -3112000 ymin: -1698000 xmax: 2258000 ymax: 1566000
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
## First 10 features:
##    STATEFP COUNTYFP    county_name_state unemp_truncated                       geometry
## 1       19      107    Keokuk County, IA             4.3 MULTIPOLYGON (((297173 4548...
## 2       19      189 Winnebago County, IA             3.4 MULTIPOLYGON (((163347 6734...
## 3       20      093    Kearny County, KS             3.1 MULTIPOLYGON (((-482328 605...
## 4       20      123  Mitchell County, KS             3.3 MULTIPOLYGON (((-212918 197...
## 5       20      187   Stanton County, KS             2.8 MULTIPOLYGON (((-528445 214...
## 6       21      005  Anderson County, KY             4.0 MULTIPOLYGON (((940067 1094...
## 7       21      029   Bullitt County, KY             4.1 MULTIPOLYGON (((873753 1022...
## 8       21      049     Clark County, KY             4.7 MULTIPOLYGON (((1012432 106...
## 9       21      059   Daviess County, KY             4.4 MULTIPOLYGON (((749702 5517...
## 10      21      063   Elliott County, KY             9.1 MULTIPOLYGON (((1102886 138...

The map works!

ggplot() +
  geom_sf(data = us_states, fill = "#0074D9", color = "white", linewidth = 0.25) +
  # Albers projection
  coord_sf(crs = st_crs("ESRI:102003"))

Map adjustments

We need to make a couple little adjustments to the map first. In the original image on Bluesky, there’s extra space on the right side of the map to allow for the legend. We can change the plot window by adding 10% of the width of the map to the right.

Technically we don’t have to work with percents here; the data is currently using the Albers projection, which works in meters, so we could add something like 500,000 meters / 500 km to the left. But this is a more general solution and also works if the map data is in decimal degrees instead of meters.

Also, the far western Aleutian islands mess with the visual balance of the map (and they don’t appear because they’re so small), so we’ll also subtract 10% of the map from the left.

# Get x-axis limits of the bounding box for the state data
xlim_current <- st_bbox(us_states)$xlim

# Add 540ish km (or 10% of the US) to the bounds (thus shifting the window over)
xlim_expanded <- c(
  xlim_current[1] + (0.1 * diff(xlim_current)), 
  xlim_current[2] + (0.1 * diff(xlim_current))
)

ggplot() +
  geom_sf(data = us_states, fill = "#0074D9", color = "white", linewidth = 0.25) +
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded)

Extract interior state borders

Because we’re using color = "white", linewidth = 0.25, every state gets a thin white border. This causes some issues though. All the states that share borders actually get a thicker border, since a state’s western border joins up with its neighbor’s eastern border. Also, all the coastlines and islands get borders, which diminishes the landmass—especially on a white background.

Like, look at Alaska’s Aleutian Islands, or Hawai’i’s smaller islands, or Michigan’s Les Cheneaux Islands and Isle Royale, or California’s Channel Islands, or the Florida Keys, or North Carolina’s Outer Banks—they all basically disappear.

To fix this, we can use st_intersection() to identify the intersections of all the state shapes (see this and this for more details)

Now all the islands and coastlines have much better definition and the borders between states are truly sized at 0.25:

interior_state_borders <- st_intersection(us_states) |>
  filter(n.overlaps > 1) |> 
  # Remove weird points that st_intersection() adds
  filter(!(st_geometry_type(geometry) %in% c("POINT", "MULTIPOINT")))

ggplot() +
  geom_sf(data = us_states, fill = "#0074D9", linewidth = 0) +
  geom_sf(data = interior_state_borders, linewidth = 0.25, color = "white") +
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded)

Map with horizontal gradient step legend

Now that we have cleaned and adjusted geographic and unemployment data, we can make a fancy map! Instead of building this sequentially, I’ve included all the code all at once, with lots of comments at each step.

A few things to note:

scale_fill_stepsn() lets you use distinct bins of color instead of a continuous gradient
We position the legend inside the plot with theme(legend.position = "inside", legend.position.inside = c(0.86, 0.32)). Those 0.86, 0.32 coordinates took a lot of tinkering to get! The units for legend.position.inside are based on percentages of the plot, so the legend appears where x is 86% across and 32% up. The position changes every time the plot dimensions change. To make life easier as I played with different values, I used {ggview} to specify and lock in exact dimensions of the plot:
```
library(ggview)

p <- ggplot(...) +
  geom_sf(...)

p + canvas(7, 5)
```
I’m not using ggview::canvas() here in the post because I’m specifying figure dimensions with Quarto chunk options instead (fig-width: 7 and fig-height: 5).

Here’s the map!

ggplot() +
  # Add counties filled with unemployment levels
  geom_sf(
    data = counties_with_unemp, aes(fill = unemp_truncated), linewidth = 0
  ) +
  # Add interior state boundaries
  geom_sf(
    data = interior_state_borders, color = "white", linewidth = 0.25
  ) +
  # Show the unemployment legend as steps instead of a standard gradient
  scale_fill_stepsn(
    colours = scales::brewer_pal(palette = "YlGnBu")(9),
    breaks = 1:10,
    limits = c(1, 10),
    # Change the label for >9%
    labels = case_match(
      1:10,
      1 ~ "1%",
      10 ~ ">9%",
      .default = as.character(1:10)
    )
  ) +
  # Yay labels
  labs(
    title = "US unemployment rates",
    subtitle = "2016 annual averages by county",
    caption = "Source: US Bureau of Labor Statistics",
    fill = "Unemployment rate"
  ) +
  # Use Albers projection and new x-axis limits
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded) +
  # Theme adjustments
  theme_fancy_map() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.86, 0.32),
    legend.direction = "horizontal",
    legend.text = element_text(size = rel(0.55)),
    legend.title = element_text(hjust = 0.5, face = "bold", size = rel(0.7), margin = margin(t = 3)),
    legend.title.position = "bottom",
    legend.key.width = unit(1.55, "lines"),
    legend.key.height = unit(0.7, "lines")
  )

Map with histogram legend

We can replace the step gradient legend with a histogram that is filled using the same colors as the step legend.

The easiest method that gives us the most control over the legend histogram is to create a separate plot object for the histogram and place it inside the map with {patchwork}’s inset_element().

Here’s the histogram, again with comments at each step. Only one neat trick to note here:

geom_histogram automatically determines the bin width for the variable assigned to the x aesthetic. In order to fill each bar by bin-specific color, we need to access information about those newly created bins. We can do this with after_stat()—here we fill each bar using the already-calculated x bin categories with fill = after_stat(factor(x))

hist_legend <- ggplot(bls_2016, aes(x = unemp_truncated)) +
  # Fill each histogram bar using the x axis category that ggplot creates
  geom_histogram(
    aes(fill = after_stat(factor(x))), 
    binwidth = 1, boundary = 0, color = "white"
  ) +
  # Fill with the same palette as the map
  scale_fill_brewer(palette = "YlGnBu", guide = "none") +
  # Modify the x-axis labels to use >9%
  scale_x_continuous(
    breaks = 2:10, 
    labels = case_match(
      2:10,
      2 ~ "2%",
      10 ~ ">9%",
      .default = as.character(2:10)
    )
  ) +
  # Just one label to replicate the legend title
  labs(x = "Unemployment rate") +
  # Theme adjustments
  theme_fancy_map() +
  theme(
    axis.text.x = element_text(size = rel(0.55)),
    axis.title.x = element_text(size = rel(0.68), margin = margin(t = 3, b = 3), face = "bold")
  )
hist_legend

Next, we’ll place that hist_legend plot inside a map with inset_element(). Like legend.position.inside = c(0.86, 0.32) in the previous map, the left = 0.75, bottom = 0.26, right = 0.98, top = 0.5 values here are percentages of the plot area and they’re fully dependent on the overall dimensions of the plot. Getting these exact numbers took a lot of manual adjusting, and ggview::canvas() was once again indispensable for keeping the plot dimensions constant.

unemp_map <- ggplot() +
  # Add counties filled with unemployment levels
  geom_sf(
    data = counties_with_unemp, aes(fill = unemp_truncated), color = NA, linewidth = 0
  ) +
  # Add interior state boundaries
  geom_sf(
    data = interior_state_borders, color = "white", linewidth = 0.25, fill = NA
  ) +
  # Show the unemployment legend as steps instead of a standard gradient, but
  # don't actually show the legend
  scale_fill_stepsn(
    colours = scales::brewer_pal(palette = "YlGnBu")(9),
    breaks = 1:10, 
    guide = "none"
  ) +
  # Yay labels
  labs(
    title = "US unemployment rates",
    subtitle = "2016 annual averages by county",
    caption = "Source: US Bureau of Labor Statistics"
  ) +
  # Use Albers projection and new x-axis limits
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded) +
  # Theme stuff
  theme_fancy_map()

# Add the histogram to the map
combined_map_hist <- unemp_map + 
  inset_element(hist_legend, left = 0.75, bottom = 0.26, right = 0.98, top = 0.45)
combined_map_hist

Map with automatic histogram legend with {legendry}

Finally, the new {legendry} package makes it so we can create a custom histogram-based legend without needing to use {patchwork} with a separate histogram plot!

It doesn’t provide as much control over the resulting histogram. The gizmo_histogram() function uses base R’s hist() behind the scenes, so we have to specify bin widths and other settings in hist.arg as base R arguments, like breaks = 10 instead of ggplot’s binwidth = 10.

Not all of hist()’s options seem to work here. For instance, I get a warning if I use border = "white" to add a white border around each bar (argument ‘border’ is not made use of), since that border option is disabled when using base R’s hist() with plot = FALSE:

hist(counties_with_unemp$unemp_truncated, breaks = 10, border = "white", plot = FALSE)
## Warning in hist.default(counties_with_unemp$unemp_truncated, breaks = 10, : argument 'border' is not made use of
## $breaks
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $counts
## [1]  12 244 589 821 644 410 205  97 119
## 
## $density
## [1] 0.00382 0.07768 0.18752 0.26138 0.20503 0.13053 0.06527 0.03088 0.03789
## 
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
## 
## $xname
## [1] "counties_with_unemp$unemp_truncated"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Also, it’s currently filling each histogram bar with the full gradient, not the 9 distinct steps, and I can’t figure out how to define custom colors for each bar—and it might not even be possible since color settings aren’t picked up anyway because of {legendry}’s use of plot = FALSE 🤷‍♂️.

But despite these downsides, this automatic histogram legend with {legendry} is really neat!

library(legendry)

# Create a custom histogram guide
histogram_guide <- compose_sandwich(
  middle = gizmo_histogram(just = 0, hist.arg = list(breaks = 10)),
  text = "axis_base"
)

ggplot() +
  # Add counties filled with unemployment levels
  geom_sf(
    data = counties_with_unemp, aes(fill = unemp_truncated), color = NA, linewidth = 0
  ) +
  # Add interior state boundaries
  geom_sf(
    data = interior_state_borders, color = "white", linewidth = 0.25, fill = NA
  ) +
  # Show the unemployment legend with a custom histogram guide
  scale_fill_stepsn(
    colours = scales::brewer_pal(palette = "YlGnBu")(9),
    breaks = 1:10,
    limits = c(1, 10),
    guide = histogram_guide,
    # Change the label for >9%
    labels = case_match(
      1:10,
      1 ~ "1%",
      10 ~ ">9%",
      .default = as.character(1:10)
    )
  ) +
  # Yay labels
  labs(
    title = "US unemployment rates",
    subtitle = "2016 annual averages by county",
    caption = "Source: US Bureau of Labor Statistics",
    fill = "Unemployment rate"
  ) +
  # Use Albers projection and new x-axis limits
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded) +
  # Theme stuff
  theme_fancy_map() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.86, 0.32),
    legend.direction = "horizontal",
    legend.text = element_text(size = rel(0.55)),
    legend.title = element_text(hjust = 0.5, face = "bold", size = rel(0.7), margin = margin(t = 3)),
    legend.title.position = "bottom"
  )

Bonus! Use points instead of choropleths

We’re still using choropleth maps here, which still isn’t ideal for showing the idea that “land isn’t unemployed”. One solution is to plot points that are sized by population. This is pretty straightforward with {sf}—we need to convert the county polygons into single points, which we can do with st_point_on_surface(). Then, after a bunch of tinkering with legend options, we’ll have this gorgeous map:

# Convert the county polygons into single points
counties_with_unemp_points <- counties_with_unemp |> 
  st_point_on_surface()

unemp_map_points <- ggplot() +
  # Use a gray background
  geom_sf(data = us_states, fill = "gray90", linewidth = 0) +
  geom_sf(data = interior_state_borders, linewidth = 0.25, color = "white") +
  # Include semi-transparent points with shape 21 (so there's a border)
  geom_sf(
    data = counties_with_unemp_points, 
    aes(size = labor_force, fill = unemp_truncated), 
    pch = 21, color = "white", stroke = 0.25, alpha = 0.8
  ) +
  # Control the size of the points in the legend
  scale_size_continuous(
    range = c(1, 9), labels = scales::label_comma(), 
    breaks = c(10000, 100000, 1000000),
    # Make the points black and not have a border
    guide = guide_legend(override.aes = list(pch = 19, color = "black"))
  ) +
  # Show the unemployment legend as steps instead of a standard gradient, but
  # don't actually show the legend
  scale_fill_stepsn(
    colours = scales::brewer_pal(palette = "YlGnBu")(9),
    breaks = 1:10, 
    guide = "none"
  ) +
  # Labels
  labs(
    title = "US unemployment rates",
    subtitle = "2016 annual averages by county",
    caption = "Source: US Bureau of Labor Statistics",
    fill = "Unemployment rate",
    size = "Labor force"
  ) +
  # Albers
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded) +
  # Theme stuff
  theme_fancy_map() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.837, 0.13),
    legend.text = element_text(size = rel(0.55)),
    legend.title = element_text(hjust = 0.5, face = "bold", size = rel(0.7), margin = margin(t = 3)),
    legend.title.position = "bottom"
  )

# Add the histogram to the map
combined_map_hist_points <- unemp_map_points + 
  inset_element(hist_legend, left = 0.75, bottom = 0.26, right = 0.98, top = 0.45)
combined_map_hist_points

Bonus #2! Use a diverging color scheme + nested legend circles

But wait, there’s more! Based on discussions with really smart dataviz people on Bluesky in the wake of me posting about this blog post there, we can make two additional tweaks:

While the different sizes for the points are neat, I’m not a fan of how big the vertical spacing is between the 10,000; 100,000; and 1,000,000. Unfortunately there’s no way to change it. Technically we can use legend.key.spacing.y in theme() to adjust it, but that doesn’t work as expected here because each of those legend entries is sized to match the largest point—i.e., the point for 1,000,000 is the biggest, so the legend entries for all the other values match its height, even if they don’t need all that space.

To fix this, we can use guide_circles() from {legendry} to show the different point sizes as coencentric circles, which is more compact (and just looks neat).
Instead of showing a range of low → high values, we can color these counties based on a meaningful midpoint to help highlight which counties are doing great (low unemployment! good!) and which aren’t (high unemployment! bad!). That might not always necessarily be the best approach—showing the full range of actual values like in the original map is a way of just describing the range and doesn’t inherently imply good or bad. But in other plots where data might be more actionable, divergences from some central value would be much more helpful.

In the United States, the Federal Reserve has a unique dual mandate to use macroeconomic policies to target both inflation and unemployment (most other countries’ central banks only target inflation). The Fed typically aims for an inflation rate of 2% and an unemployment rate of 4ish%. So in this new map, we’ll center each county’s unemployment rate around 4% and show the percentage point deviations from that Fed target. Counties colored in darker red have higher unemployment rates than the target; counties colored in blue have lower rates than the target.

We can then imagine that we’re a policymaker interested in unemployment trends—we can look at the map and quickly identify areas that are doing poorly and doing well.

Up at the beginning of the document where we loaded and cleaned the bls_2016 dataset, I’ve added a new variable that centers the unemployment rate at 4:

mutate(unemp_diff = unemp_truncated - 4)

We can then use this to create a new histogram and new map colored with the “vik” palette from the {scico} package, which has lots of neat diverging palettes. We’ll also create a fancy circle-based legend with {legendry}. Here’s the fully annotated code and final map:

library(ggtext)  # For Markdown-based text in ggplot
library(scico)   # For perceptually uniform colors

# Make new histogram legend
hist_legend_diffs <- ggplot(bls_2016, aes(x = unemp_diff)) +
  # Fill each histogram bar using the x axis category that ggplot creates
  # Use boundary = 0.5 to shift the bin ranges from things like 1-2 to 1.5-2.5
  geom_histogram(
    aes(fill = after_stat((x))), 
    binwidth = 1, boundary = 0.5, color = "white"
  ) +
  # Fill with the same palette as the map
  # scale_fill_brewer(palette = "YlGnBu", guide = "none") +
  scale_fill_scico(palette = "vik", midpoint = 0, guide = "none") +
  # Modify the x-axis labels to show perentage point values and format them with
  # markdown to get original unemployment values on separate lines
  scale_x_continuous(
    breaks = -2:5, 
    labels = case_match(
      -2:5,
      -2 ~ "**−2 pp.**
(2%)",
      0 ~ "**0**
(4% ±
0.5 pp.)",
      5 ~ "**>+4 pp.**
(>9%)",
      .default = glue::glue(
        "**{x}**", 
        x = scales::label_comma(
          style_positive = "plus", style_negative = "minus"
        )(-2:5))
    )
  ) +
  # Just one label to replicate the legend title
  labs(x = "Difference from Fed target") +
  # Theme adjustments
  theme_fancy_map() +
  theme(
    axis.text.x = element_markdown(size = rel(0.5), vjust = 1, lineheight = 1.3),
    axis.title.x = element_text(size = rel(0.68), margin = margin(t = 3, b = 3), face = "bold")
  )

unemp_map_points_diffs <- ggplot() +
  # Use a lighter gray background
  geom_sf(data = us_states, fill = "gray95", linewidth = 0) +
  # Use slightly darker state borders
  geom_sf(data = interior_state_borders, linewidth = 0.25, color = "grey60") +
  # Include semi-transparent points with shape 21 (so there's a border)
  geom_sf(
    data = counties_with_unemp_points, 
    aes(size = labor_force, fill = unemp_diff), 
    shape = 21, color = "white", stroke = 0.25, alpha = 0.8
  ) +
  # Control the size of the points in the legend
  scale_size_continuous(
    range = c(1, 11), labels = scales::label_comma(), 
    breaks = c(100000, 1000000, 5000000),
    # Make the points black and not have a border
    guide = guide_circles(
      text_position = "right",
      override.aes = list(
        fill = "grey30", alpha = 0.8
      )
    )
  ) +
  # This is tricky! We want to use the diverging vik palette but have it 
  # centered at 0. With scale_fill_scico(), there's a midpoint argument, like we 
  # used for the histogram. For generating regular lists of colors with scico(), 
  # though, there's no midpoint argument. Instead, we need to make a few 
  # specific adjustments: 
  #
  # 1. Generate 11 possible colors, since there are 5 colors above the 0 
  #    midpoint in the histogram and we need 5 parallel negative colors below 0 
  #    (even though we're only using 2)
  # 2. Set the limits of the legend to the symmetrical -5 to 5 range so that 
  #    it's centered at 0
  # 3. Set the breaks to go asymmetrically from -2:5. But actually set them 
  #    from -2.5 to 4.5 since that matches the shifted histogram, which uses a 
  #    boundary of 0.5 instead of 0 (so the histogram bins cover ranges like 
  #    0.5 to 1.5 instead of 0 to 1)
  scale_fill_stepsn(
    colours = scico::scico(11, palette = "vik"),
    limits = c(-5, 5),
    breaks = seq(-2.5, 4.5, by = 1),
    guide = "none"
  ) +
  # Labels
  labs(
    title = "US unemployment (2016)",
    subtitle = "Differences from the Federal Reserve's 4% target",
    caption = "Source: US Bureau of Labor Statistics",
    fill = "Unemployment rate",
    size = "County labor force"
  ) +
  # Albers
  coord_sf(crs = st_crs("ESRI:102003"), xlim = xlim_expanded) +
  # Theme adjustments
  theme_fancy_map() +
  theme(
    # {legendry} complains if there's no legend.margin setting; using 
    # theme_void() removes that setting and breaks the plot, so we specify 
    # some 0 values here
    legend.margin = margin(0, 0, 0, 0, "pt"),
    legendry.legend.key.margin = margin(0, 5, 0, 0, "pt"),
    legend.ticks = element_line(colour = "black", linetype = "22"),
    legend.position = "inside",
    legend.position.inside = c(0.87, 0.17),
    legend.text = element_text(size = rel(0.55)),
    legend.title = element_text(hjust = 0.5, face = "bold", size = rel(0.7), margin = margin(t = 3)),
    plot.subtitle = element_text(hjust = 0.18),
    legend.title.position = "bottom"
  )

# Add the histogram to the map
combined_map_hist_points_diffs <- unemp_map_points_diffs + 
  inset_element(hist_legend_diffs, left = 0.75, bottom = 0.26, right = 0.98, top = 0.45)
combined_map_hist_points_diffs

Citation

BibTeX citation:

@online{heiss2025,
  author = {Heiss, Andrew},
  title = {How to Use a Histogram as a Legend in \{Ggplot2\}},
  date = {2025-02-19},
  url = {https://www.andrewheiss.com/blog/2025/02/19/ggplot-histogram-legend/},
  doi = {10.59350/gt0nr-wct91},
  langid = {en}
}

For attribution, please cite this work as:

Heiss, Andrew. 2025. “How to Use a Histogram as a Legend in {Ggplot2}.” February 19, 2025. https://doi.org/10.59350/gt0nr-wct91.

How to move Crimea from Russia to Ukraine in maps with R

Andrew Heiss — Thu, 13 Feb 2025 05:00:00 GMT

The Natural Earth Project

The Natural Earth Project provides high quality public domain geographic data with all sorts of incredible detail, at three resolutions: high (1:10m), medium (1:50m), and low (1:110m). I use their data all the time in my own work and research, and the {rnaturalearth} package makes it really easy to get their data into R for immediate mapping. I mean, look at this!

library(tidyverse)
library(sf)
library(rnaturalearth)

# Set some colors
ukr_blue <- "#0057b7"  # Blue from the Ukrainian flag
ukr_yellow <- "#ffdd00"  # Yellow from the Ukrainian flag
rus_red <- "#d62718"  # Red from the Russian flag

clr_ocean <- "#d9f0ff"
clr_land <- "#facba6"

# CARTOColors Prism (https://carto.com/carto-colors/)
carto_prism = c(
  "#5F4690", "#1D6996", "#38A6A5", "#0F8554", "#73AF48", "#EDAD08", 
  "#E17C05", "#CC503E", "#94346E", "#6F4070", "#994E95", "#666666"
)

ne_countries(scale = 110) |> 
  filter(admin != "Antarctica") |> 
  ggplot() + 
  geom_sf(aes(fill = continent), color = "white", linewidth = 0.1) +
  scale_fill_manual(values = carto_prism, guide = "none") +
  coord_sf(crs = "+proj=robin") +
  theme_void()

Natural Earth’s de facto policy

Maps are intensely political things. There are dozens of disputes over maps and land and territories (e.g., Palestine, Western Sahara, Northern Cyprus, Taiwan, Kashmir, etc.), and many UN member states don’t recognize other UN member states (Israel isn’t recognized by many Arab states; Pakistan doesn’t recognize Armenia).

The Natural Earth Project’s official policy for disputed territories is to reflect on-the-ground de facto control over each piece of land¹:

¹ OpenStreetMap does this too.

Natural Earth Vector draws boundaries of sovereign states according to defacto status. We show who actually controls the situation on the ground. For instance, we show China and Taiwan as two separate states. But we show Palestine as part of Israel.

Though they claim that this de facto policy “is rigorous and self consistent”, it gets them in trouble a lot. For instance, there are nearly two dozen issues on GitHub about Crimea, which is illegally occupied by Russia but de jure part of Ukraine. There are huge debates over the ethics of the de facto policy.

Treating the Natural Earth de facto policy as a de facto policy

I’m not weighing in on that policy here! I don’t super like it—it makes it really hard to map Palestine, for instance—but it is what it is. In this post I’m treating the de facto policy as the de facto situation of the data.

Natural Earth de jure points of view

Natural Earth’s solution for disputed territories is to offer different options to reflect country-specific de jure points of view. They offer pre-built high resolution shapefiles for 31 different points of views, so it’s possible to download data that reflect de jure boundaries for a bunch of different countries. Their other shapefiles all have columns like fclass_us, fclass_ua, and so on for doing… something?… with the point of view. I can’t figure out how these columns work beyond localization stuff (i.e. changing place names based on the point of view). The documentation doesn’t say much about how to actually use these different points of view, and pre-built medium and low resolution maps don’t exist yet.

For example, the US doesn’t de jure-ily recognize the Russian occupation of Crimea, so if we download the pre-built high resolution (10m) version of the world from the US point of view, we can see Crimea as part of Ukraine (we have to download this manually—rnaturalearth::ne_countries() doesn’t support POV files):

world_10_us <- read_sf("ne_10m_admin_0_countries_usa/ne_10m_admin_0_countries_usa.shp")

world_10_us |> 
  filter(ADMIN == "Ukraine") |> 
  ggplot() +
  geom_sf(fill = ukr_blue) +
  theme_void()

Unfortunately, the pre-built point-of-view datasets only exist for the 10m high resolution data. If we want to show medium or low resolution maps, we’re stuck with the de facto version of the map, which means Crimea will be shown as part of Russia. Here’s the low resolution version of Ukraine, with Crimea in Russia:

world <- ne_countries(scale = 110, type = "map_units")

ukraine <- world |> filter(admin == "Ukraine")
russia <- world |> filter(admin == "Russia")

ukraine_bbox <- ukraine |> 
  st_buffer(dist = 100000) |>  # Add 100,000 meter buffer around the country 
  st_bbox()

ggplot() +
  geom_sf(data = world, fill = clr_land) +
  geom_sf(data = russia, fill = rus_red) + 
  geom_sf(data = ukraine, fill = ukr_blue, color = ukr_yellow, linewidth = 2) + 
  coord_sf(
    xlim = c(ukraine_bbox["xmin"], ukraine_bbox["xmax"]), 
    ylim = c(ukraine_bbox["ymin"], ukraine_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

Relocating Crimea manually with R and {sf}

Natural Earth’s recommendation is to “mashup our countries and disputed areas themes to match their particular political outlook”, so we’ll do that here. Though we won’t use any of the point-of-view themes or features because I have no idea how to get those to work.

Instead we’ll manipulate the geometry data directly and move Crimea from the Russia shape to the Ukraine shape by extracting the Crimea POLYGON from Russia and merging it with Ukraine.

The actual geometric shapes for all the countries in world are MULTIPOLYGONs, or collections of POLYGON geometric objects. For instance, Russia is defined as a single MULTIPOLYGON:

russia |> st_geometry()
## Geometry set for 1 feature 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: 41.15 xmax: 180 ymax: 81.25
## Geodetic CRS:  WGS 84
## MULTIPOLYGON (((178.7 71.1, 180 71.52, 180 70.8...

We can split MULTIPOLYGONs into their component POLYGONs with st_cast(). Russia consists of 14 different shapes:

russia_polygons <- russia |> 
  st_geometry() |> 
  st_cast("POLYGON")

russia_polygons
## Geometry set for 14 features 
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: 41.15 xmax: 180 ymax: 81.25
## Geodetic CRS:  WGS 84
## First 5 geometries:
## POLYGON ((178.7 71.1, 180 71.52, 180 70.83, 178...
## POLYGON ((49.1 46.4, 48.65 45.81, 47.68 45.64, ...
## POLYGON ((93.78 81.02, 95.94 81.25, 97.88 80.75...
## POLYGON ((102.8 79.28, 105.4 78.71, 105.1 78.31...
## POLYGON ((138.8 76.14, 141.5 76.09, 145.1 75.56...

The second one is the main Russia landmass:

plot(russia_polygons[2])

The last one is the Crimean peninsula:

plot(russia_polygons[14])

Identifying the Crimea POLYGON from a POINT

The only way I figured out what of these POLYGONs were was to plot them individually until I saw a recognizable shape. And if I use a different map (like the 50m or 10m resolution maps), there’s no guarantee that Russia will have 14 POLYGONs or that the 14th one will be Crimea. We need a more reliable way to find the Crimea shape.

One way to do this is to create a POINT object based somewhere in Crimea and do some geometric set math to identify which Russian POLYGON contains it. The point 45°N 34°E happens to be in the middle of Crimea:

crimea_point <- st_sfc(st_point(c(34, 45)), crs = st_crs(world))

ggplot() +
  geom_sf(data = world, fill = clr_land) +
  geom_sf(data = russia, fill = rus_red) + 
  geom_sf(data = ukraine, fill = ukr_blue, color = ukr_yellow, linewidth = 2) + 
  geom_sf(data = crimea_point) +
  coord_sf(
    xlim = c(ukraine_bbox["xmin"], ukraine_bbox["xmax"]), 
    ylim = c(ukraine_bbox["ymin"], ukraine_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

We can use it with st_intersects() to identify the Russia POLYGON that contains it:

# Extract the Russia MULTIPOLYGON and convert it to polygons
russia_polygons <- world |> 
  filter(admin == "Russia") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Extract the Russia polygon that has Crimea in it
crimea_polygon <- russia_polygons |>
  keep(\(x) st_intersects(x, crimea_point, sparse = FALSE))

# This is the same as russia_polygons[14]
plot(crimea_polygon)

Extracting the Crimea POLYGON from Russia

We can then remove that polygon from Russia and recombine everything back into a MULTIPOLYGON. It works!

# Remove Crimea from Russia
new_russia <- russia_polygons |>
  discard(\(x) any(st_equals(x, crimea_polygon, sparse = FALSE))) |> 
  st_combine() |> 
  st_cast("MULTIPOLYGON")

ggplot() +
  geom_sf(data = world, fill = clr_land) +
  geom_sf(data = new_russia, fill = rus_red) + 
  geom_sf(data = ukraine, fill = ukr_blue, color = ukr_yellow, linewidth = 2) + 
  coord_sf(
    xlim = c(ukraine_bbox["xmin"], ukraine_bbox["xmax"]), 
    ylim = c(ukraine_bbox["ymin"], ukraine_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

Adding the Crimea POLYGON to Ukraine

Next we need to merge crimea_polygon with Ukraine. We’ll convert Ukraine to its component POLYGONs, combine those with Crimea, and recombine everything back to a MULTIPOLYGON. It also works!

# Extract the Ukraine MULTIPOLYGON and convert it to polygons
ukraine_polygons <- world |> 
  filter(admin == "Ukraine") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Add Crimea to Ukraine
new_ukraine <- st_union(c(ukraine_polygons, crimea_polygon)) |>
  st_cast("MULTIPOLYGON")

ggplot() +
  geom_sf(data = world, fill = clr_land) +
  geom_sf(data = new_russia, fill = rus_red) + 
  geom_sf(data = new_ukraine, fill = ukr_blue, color = ukr_yellow, linewidth = 2) + 
  coord_sf(
    xlim = c(ukraine_bbox["xmin"], ukraine_bbox["xmax"]), 
    ylim = c(ukraine_bbox["ymin"], ukraine_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

Updating Russia and Ukraine in the full data

The last step is to modify the full world dataset and replace the existing geometry values for the two countries with the updated boundaries:

world_un <- world |>
  mutate(geometry = case_when(
    admin == "Ukraine" ~ new_ukraine,
    admin == "Russia" ~ new_russia,
    .default = geometry
  ))

Now that world_un has the corrected boundaries in it, it works like normal. Here’s a map of Eastern Europe, colored by mapcolor9 (a column that comes with Natural Earth data that lets you use 9 distinct colors to fill all countries without having bordering countries share colors). Crimea is in Ukraine now:

eastern_eu_bbox <- ukraine |> 
  st_buffer(dist = 700000) |>  # Add 700,000 meter buffer around the country 
  st_bbox()

ggplot() +
  geom_sf(data = world_un, aes(fill = factor(mapcolor9)), linewidth = 0.25, color = "white") +
  scale_fill_manual(values = carto_prism, guide = "none") +
  coord_sf(
    xlim = c(eastern_eu_bbox["xmin"], eastern_eu_bbox["xmax"]), 
    ylim = c(eastern_eu_bbox["ymin"], eastern_eu_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

The whole game

Everything above was fairly didactic, with illustrations at each intermediate step. Here’s the whole process all in one place:

world_110 <- ne_countries(scale = 110, type = "map_units")

crimea_point_110 <- st_sfc(st_point(c(34, 45)), crs = st_crs(world_110))

# Extract the Russia MULTIPOLYGON and convert it to polygons
russia_polygons_110 <- world_110 |> 
  filter(admin == "Russia") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Extract the Russia polygon that has Crimea in it
crimea_polygon_110 <- russia_polygons_110 |>
  keep(\(x) st_intersects(x, crimea_point_110, sparse = FALSE))

# Remove Crimea from Russia
new_russia_110 <- russia_polygons_110 |>
  discard(\(x) any(st_equals(x, crimea_polygon_110, sparse = FALSE))) |> 
  st_combine() |> 
  st_cast("MULTIPOLYGON")

# Extract the Ukraine MULTIPOLYGON and convert it to polygons
ukraine_polygons_110 <- world_110 |> 
  filter(admin == "Ukraine") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Add Crimea to Ukraine
new_ukraine_110 <- st_union(c(ukraine_polygons_110, crimea_polygon_110)) |>
  st_cast("MULTIPOLYGON")

world_un_110 <- world_110 |>
  mutate(geometry = case_when(
    admin == "Ukraine" ~ new_ukraine_110,
    admin == "Russia" ~ new_russia_110,
    .default = geometry
  ))

Moving Crimea with medium resolution (50m) data

This same approach works for other map resolutions too, like 50m:

world_50 <- ne_countries(scale = 50, type = "map_units")

crimea_point_50 <- st_sfc(st_point(c(34, 45)), crs = st_crs(world_50))

# Extract the Russia MULTIPOLYGON and convert it to polygons
russia_polygons_50 <- world_50 |> 
  filter(admin == "Russia") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Extract the Russia polygon that has Crimea in it
crimea_polygon_50 <- russia_polygons_50 |>
  keep(\(x) st_intersects(x, crimea_point_50, sparse = FALSE))

# Remove Crimea from Russia
new_russia_50 <- russia_polygons_50 |>
  discard(\(x) any(st_equals(x, crimea_polygon_50, sparse = FALSE))) |> 
  st_combine() |> 
  st_cast("MULTIPOLYGON")

# Extract the Ukraine MULTIPOLYGON and convert it to polygons
ukraine_polygons_50 <- world_50 |> 
  filter(admin == "Ukraine") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Add Crimea to Ukraine
new_ukraine_50 <- st_union(c(ukraine_polygons_50, crimea_polygon_50)) |>
  st_cast("MULTIPOLYGON")

world_un_50 <- world_50 |>
  mutate(geometry = case_when(
    admin == "Ukraine" ~ new_ukraine_50,
    admin == "Russia" ~ new_russia_50,
    .default = geometry
  ))

Here’s a higher quality map of Eastern Europe with Crimea in Ukraine:

ggplot() +
  geom_sf(data = world_un_50, aes(fill = factor(mapcolor9)), linewidth = 0.25, color = "white") +
  scale_fill_manual(values = carto_prism, guide = "none") +
  coord_sf(
    xlim = c(eastern_eu_bbox["xmin"], eastern_eu_bbox["xmax"]), 
    ylim = c(eastern_eu_bbox["ymin"], eastern_eu_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

Using the adjusted Natural Earth data as GeoJSON in Observable JS

This updated shapefile works with Observable Plot too (see here for more about how to make nice maps with Observable), but requires one strange tweak because of weird behavior with the GeoJSON file format.

Broken GeoJSON

Let’s export the adjusted geographic data to GeoJSON:

# Save as geojson for Observable Plot
st_write(
  obj = world_un, 
  dsn = "ne_110m_admin_0_countries_un_BROKEN.geojson", 
  driver = "GeoJSON",
  quiet = TRUE,
  delete_dsn = TRUE  # Overwrite the existing .geojson if there is one
)

And then load it with Observable JS:

world_broken = FileAttachment("ne_110m_admin_0_countries_un_BROKEN.geojson").json()

clr_ocean = "#d9f0ff"
clr_land = "#facba6"
ukr_blue = "#0057b7"
ukr_yellow = "#ffdd00"
rus_red = "#d62718"

And then plot it:

Plot.plot({
  projection: "equal-earth",
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_broken, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }) 
  ]
})

lol what even. The new Ukraine shape seems to have broken boundaries that distort everything else in the map. Weirdly, Ukraine is filled with the ocean color while the rest of the globe—both the ocean and whatever countries didn’t have their borders erased—is the color of land.

Let’s zoom in on just Ukraine:

ukraine = world_broken.features.find(d => d.properties.name === "Ukraine")

Plot.plot({
  projection: { 
    type: "equal-earth", 
    domain: ukraine, 
    inset: 50 
  }, 
  width: 800, 
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_broken, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }),
    Plot.geo(ukraine, { fill: ukr_blue })
  ]
})

¯\_(ツ)_/¯. Now Ukraine is the color of the ocean and the whole rest of the world is the dark blue of the Ukrainian flag. And it didn’t zoom in at all.

GeoJSON and ↻ ↺ winding order ↻ ↺

This is a symptom of an issue with GeoJSON winding order. GeoJSON cares about the direction that country borders (and all LINESTRING elements) are drawn in. Exterior borders should be drawn counterclockwise; interior borders should be drawn clockwise. If a geographic shape doesn’t follow this winding order, bad things happen. Specifically:

a shape that represents a tiny speck of land becomes inflated to represent the whole globe minus that tiny speck of land, the map fills with a uniform color, the local projection explodes. (via @fil)

That’s exactly what’s happening here. Somehow the winding order is getting reversed when we combine Ukraine with Crimea. {sf} itself doesn’t care about winding order, so everything works fine within R; GeoJSON is picky about winding order, so things break.

Fixing it is tricky though! {sf} uses a bunch of different libraries behind the scenes to do its geographic calculations, including GEOS and S2, and they all have different approaches to polygon creation. Apparently GEOS goes clockwise by default while others go counterclockwise, or something. It should theoretically be possible to fix by adding st_sfc(check_ring_dir = TRUE) after making the new Ukraine shape:

# It would be cool if this worked but it doesn't :(
new_ukraine <- st_union(c(ukraine_polygons, crimea_polygon)) |>
  st_sfc(check_ring_dir = TRUE) |> 
  st_cast("MULTIPOLYGON")

But that doesn’t change anything (nor does it work for this person at GitHub).

Clean GeoJSON with correct winding order

BUT there’s another solution. We can force {sf} to not use the S2 library (which it uses by default, I guess?), since S2 seems go in the wrong direction. If we turn off S2 with sf_use_s2(FALSE), make the new Ukraine shape, and then turn S2 back on with sf_use_s2(TRUE), things work!

# Add Crimea to Ukraine
sf_use_s2(FALSE)
new_ukraine_110 <- st_union(c(ukraine_polygons_110, crimea_polygon_110)) |>
  st_cast("MULTIPOLYGON")
sf_use_s2(TRUE)

Here’s the full process with the 110m map:

world_110 <- ne_countries(scale = 110, type = "map_units")

crimea_point_110 <- st_sfc(st_point(c(34, 45)), crs = st_crs(world_110))

# Extract the Russia MULTIPOLYGON and convert it to polygons
russia_polygons_110 <- world_110 |> 
  filter(admin == "Russia") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Extract the Russia polygon that has Crimea in it
crimea_polygon_110 <- russia_polygons_110 |>
  keep(\(x) st_intersects(x, crimea_point_110, sparse = FALSE))

# Extract the Ukraine MULTIPOLYGON and convert it to polygons
ukraine_polygons_110 <- world_110 |> 
  filter(admin == "Ukraine") |> 
  st_geometry() |> 
  st_cast("POLYGON")

# Add Crimea to Ukraine
sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
new_ukraine_110 <- st_union(c(ukraine_polygons_110, crimea_polygon_110)) |>
  st_cast("MULTIPOLYGON")
## although coordinates are longitude/latitude, st_union assumes that they are planar
sf_use_s2(TRUE)
## Spherical geometry (s2) switched on

# Remove Crimea from Russia
new_russia_110 <- russia_polygons_110 |>
  discard(\(x) any(st_equals(x, crimea_polygon_110, sparse = FALSE))) |> 
  st_combine() |> 
  st_cast("MULTIPOLYGON")

# Add the modified Russia and Ukraine to the main data
world_un_110_fixed <- world_110 |>
  mutate(geometry = case_when(
    admin == "Ukraine" ~ new_ukraine_110,
    admin == "Russia" ~ new_russia_110,
    .default = geometry
  ))

# Save as GeoJSON
st_write(
  obj = world_un_110_fixed, 
  dsn = "ne_110m_admin_0_countries_un.geojson", 
  driver = "GeoJSON",
  quiet = TRUE,
  delete_dsn = TRUE
)

Here’s the new world map with the the correct Ukraine:

world_fixed = FileAttachment("ne_110m_admin_0_countries_un.geojson").json()

Plot.plot({
  projection: "equal-earth",
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_fixed, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }) 
  ]
})

We can zoom in on Ukraine too:

ukraine_good = world_fixed.features.find(d => d.properties.name === "Ukraine")
russia = world_fixed.features.find(d => d.properties.name === "Russia")

Plot.plot({
  projection: { 
    type: "equal-earth", 
    domain: ukraine_good, 
    inset: 50 
  }, 
  width: 800, 
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_fixed, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }),
    Plot.geo(russia, { fill: rus_red }),
    Plot.geo(ukraine_good, { 
      fill: ukr_blue, 
      stroke: ukr_yellow, 
      strokeWidth: 3
    })
  ]
})

Alternative data sources

Natural Earth isn’t the only source of geographic data online, and other sources use de jure borders instead of de facto borders, like these:

GISCO

The European Commission’s Eurostat hosts the Geographic Information System of the Commission (GISCO), which provides GIS data for the EU. They offer global shapefiles that follow EU-based de jure borders. The {giscoR} package provides a nice frontend for getting that data into R at 5 different resolutions (1:60m, 1:20m, 1:10m, 1:3m, and super detailed 1:1m!). It does not come with additional metadata for each country, though (i.e. there are no regional divisions, population values, map colors, names in other languages, and so on), so it requires some extra cleaning work if you want those details. For example, can add region information with {countrycode}:

library(giscoR)
library(countrycode)

world_gisco <- gisco_get_countries(
  year = "2024",
  epsg = "4326",
  resolution = "60"
) |> 
  # Add World Bank regions
  mutate(region = countrycode(ISO3_CODE, origin = "iso3c", destination = "region"))

world_gisco |> 
  filter(NAME_ENGL != "Antarctica") |> 
  ggplot() + 
  geom_sf(aes(fill = region), color = "white", linewidth = 0.1) +
  scale_fill_manual(values = carto_prism, guide = "none") +
  coord_sf(crs = "+proj=robin") +
  theme_void()

Since the EU doesn’t de jure-ily recognize the Russian occupation of Crimea, Crimea is in Ukraine:

ukraine_gisco <- world_gisco |> filter(NAME_ENGL == "Ukraine")
russia_gisco <- world_gisco |> filter(NAME_ENGL == "Russian Federation")

ukraine_gisco_bbox <- ukraine_gisco |> 
  st_buffer(dist = 100000) |>  # Add 100,000 meter buffer around the country 
  st_bbox()

ggplot() +
  geom_sf(data = world_gisco, fill = clr_land) +
  geom_sf(data = russia_gisco, fill = rus_red) + 
  geom_sf(data = ukraine_gisco, fill = ukr_blue, color = ukr_yellow, linewidth = 2) + 
  coord_sf(
    xlim = c(ukraine_gisco_bbox["xmin"], ukraine_gisco_bbox["xmax"]), 
    ylim = c(ukraine_gisco_bbox["ymin"], ukraine_gisco_bbox["ymax"])
  ) +
  theme_void() +
  theme(panel.background = element_rect(fill = clr_ocean))

It’s possible to use GISCO data with Observable too. We could load the data into R and clean it up there (like adding regions and other details) and the save it as GeoJSON, like we did with the Natural Earth data.

Or we can grab the original raw GeoJSON from Eurostat directly. For example, here’s the raw 60M 2024 world map using the WGS84 (4326) projection that we grabbed with gisco_get_countries() earlier.

world_gisco = await FileAttachment("https://gisco-services.ec.europa.eu/distribution/v2/countries/geojson/CNTR_RG_60M_2024_4326.geojson").json()

ukraine_gisco = world_gisco.features.find(d => d.properties.NAME_ENGL === "Ukraine")
russia_gisco = world_gisco.features.find(d => d.properties.NAME_ENGL === "Russian Federation")

Plot.plot({
  projection: { 
    type: "equal-earth", 
    domain: ukraine_gisco, 
    inset: 50 
  }, 
  width: 800, 
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_gisco, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }),
    Plot.geo(russia_gisco, { fill: rus_red }),
    Plot.geo(ukraine_gisco, { 
      fill: ukr_blue, 
      stroke: ukr_yellow, 
      strokeWidth: 3
    })
  ]
})

Visionscarto

There are JSON-based map files created by @fil at Observable as part of the Visionscarto project.

They’re based on Natural Earth, but with some specific adjustments like adding Crimea to Ukraine, making Gaza a little bit bigger so that it doesn’t get dropped at lower resolutions like 110m, and using UN boundaries for Western Sahara.

Like GISCO, though, these don’t have the additional columns that Natural Earth comes with (country names in a bunch of languages, region and continent designations, map coloring schemes, population and GDP estimates, etc.), and those would need to be added manually in R or Observable or whatever.

import {world110m} from "@visionscarto/geo"

countries110m = topojson.feature(world110m, world110m.objects.countries)
ukraine_visionscarto = countries110m.features.find(d => d.properties.name === "Ukraine")
russia_visionscarto = countries110m.features.find(d => d.properties.name === "Russia")

Plot.plot({
  projection: { 
    type: "equal-earth", 
    domain: ukraine_visionscarto, 
    inset: 50 
  }, 
  width: 800, 
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(countries110m, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    }),
    Plot.geo(russia_visionscarto, { fill: rus_red }),
    Plot.geo(ukraine_visionscarto, { 
      fill: ukr_blue, 
      stroke: ukr_yellow, 
      strokeWidth: 3
    })
  ]
})

Less automatic sources

There are other sources too, but they require manual downloading:

Citation

BibTeX citation:

@online{heiss2025,
  author = {Heiss, Andrew},
  title = {How to Move {Crimea} from {Russia} to {Ukraine} in Maps with
    {R}},
  date = {2025-02-13},
  url = {https://www.andrewheiss.com/blog/2025/02/13/natural-earth-crimea/},
  doi = {10.59350/28kp0-nbq92},
  langid = {en}
}

For attribution, please cite this work as:

Heiss, Andrew. 2025. “How to Move Crimea from Russia to Ukraine in Maps with R.” February 13, 2025. https://doi.org/10.59350/28kp0-nbq92.

Using USAID data to make fancy world maps with Observable Plot

Andrew Heiss — Mon, 10 Feb 2025 05:00:00 GMT

As part of Elon Musk’s weird Department of Government Efficiency’s unconstitutional rampage through the federal government, USAID’s ForeignAssistance.gov was taken offline on January 31, 2025. It reappeared on February 3, but it’s not clear how long it will be available, especially as USAID is gutted (despite court orders and injunctions to stop).

I study civil society, human rights, and foreign aid and rely on USAID aid data for several of my research projects, so as a backup, I used Datasette to create a mirror website/API of the entire ForeignAssistance.gov dataset at https://foreignassistance-data.andrewheiss.com/. Everything as of December 19, 2024 is available there, both as a queryable SQL database and as downloadable CSV files.

I also made a little frontend website with links to each individual dataset. As I built that website, I decided to try recreating the ForeignAssistance.gov dashboard, which had neat interactive maps and tables.

ForeignAssistance.gov dashboard

Since Quarto has native support for Observable JS for interactive work, and since I’ve meant to really dig into Observable and figure out how to make more interactive graphs, I figured I’d play around with the rescued USAID data.

So in this post, I show what I learned about working with geographic data and making pretty maps with Observable Plot,

Caveat!

I’m really bad at Javascript! The code here is probably wildly inefficient and feels R-flavored.

But it works, and that’s all that matters :)

Working with map data

Get map data

Observable Plot uses the d3-geo module behind the scenes to parse and work with map data, and D3 typically works with data formatted as GeoJSON. There are tons of high quality geographic data sources online, like the ~~US Census~~ (they’ve been removing those in the past few weeks), IPUMS NHGIS, IPUMS IHGIS, and the Natural Earth project, and cities and states typically offer GIS data for public sector-related data. These data sources tend to be stored as shapefiles, which are a fairly complex (but standard) format for geographic data that involve multiple files.

Observable Plot/D3 might be able to work with shapefiles directly, but it’s nowhere in the documentation. They seem to expect GeoJSON instead. We could hunt around online for GeoJSON data, but—even better—we can use the {sf} package in R to convert any shapefile-based data into GeoJSON by setting driver = "GeoJSON" in sf::st_write(). Here we’ll load two datasets from Natural Earth—(1) small scale low resolution 1:110m data for mapping the whole world and (2) medium scale 1:50m data for mapping specific regions and countries—and convert them to GeoJSON files.

library(sf)
library(rnaturalearth)

# Get low resolution Natural Earth data as map units instead of countries because of France
world <- ne_countries(scale = 110, type = "map_units")

# Save as geojson for Observable Plot
st_write(
  obj = world, 
  dsn = "ne_110m_admin_0_countries.geojson", 
  driver = "GeoJSON"
)

# Save medium resolution geojson
st_write(
  obj = ne_countries(scale = 50, type = "countries"), 
  dsn = "ne_50m_admin_0_countries.geojson", 
  driver = "GeoJSON"
)

Maybe skip intermediate saving?

We could probably use Quarto’s special R-to-OJS function ojs_define() and make these R objects directly accessible to OJS without needing to save intermediate files:

ojs_define(world = ne_countries(scale = 110, type = "map_units"))

…but geographic data is complex and I don’t know how things like Observable Plot’s Plot.geo() handle data that’s not read as GeoJSON. So to keep things simple, I ended up just saving these as GeoJSON. 🤷‍♂️

Maps and projections with Observable Plot

We can load these into our document with OJS with FileAttachment():

world = FileAttachment("ne_110m_admin_0_countries.geojson").json()
world_medium = FileAttachment("ne_50m_admin_0_countries.geojson").json()

Check out the structure of world. It’s a FeatureCollection with a slot named crs with the projection information and a slot named features with entries for each country. Each country Feature has a slot named properties with columns like name, iso_a3, formal_en, pop_est, and other details.

world

To plot it, we can use the Geo mark:

Plot.plot({
  marks: [
    Plot.geo(world)
  ]
})

To make things look nicer throughout this post, we’ll define some nicer colors for countries and land and ocean from CARTOColors:

carto_prism = [
  "#5F4690", "#1D6996", "#38A6A5", "#0F8554", "#73AF48", "#EDAD08", 
  "#E17C05", "#CC503E", "#94346E", "#6F4070", "#994E95", "#666666"
]

// From R:
// clr_ocean <- colorspace::lighten("#88CCEE", 0.7)
clr_ocean = "#D9F0FF"

// From CARTOColors Peach 2
clr_land = "#facba6"

We’ll make the land be orange-ish, add some thin black borders around the countries, and include a blue background color with Plot.frame():

Plot.plot({
  marks: [
    Plot.frame({ fill: clr_ocean }),  
    Plot.geo(world, { 
      stroke: "black", 
      strokeWidth: 0.5, 
      fill: clr_land 
    }) 
  ]
})

Built-in projections

Taking a round globe and smashing it on a two-dimensional surface always requires geometric shenanigans to get things flat. We can control how things get flattened by specifying the projection for the map. Here we’ll use the Equal Earth projection (invented in 2018 to show countries and continents at their true relative sizes to each other). Since projections contain relative height and width details, we need to specify a width for the plot now. I arbitrarily chose 1000 pixels here, which is the maximum width—it should autoshrink in smaller browser windows, and the height should be calculated automatically. Finally, instead of adding the background color with Plot.frame(), we can use Plot.sphere() to get a nicer background that uses the specified projection:

Plot.plot({
  projection: "equal-earth", 
  width: 1000, 
  marks: [
    Plot.sphere({ fill: clr_ocean }), 
    Plot.geo(world, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

The Observable Plot library includes a bunch of common built-in projections:

viewof projection = Inputs.select(
  ["equirectangular", "equal-earth", "mercator", "transverse-mercator", "azimuthal-equal-area", "gnomonic"],
  {value: "azimuthal-equal-area", label: "Projection"}
)

Plot.plot({
  projection: projection,
  // width,
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Other projections

Observable Plot can support any other D3 projection too. There are a whole bunch of projections in the main d3-geo module, and there’s a separate d3-geo-projection module for dozens of others. My favorite global projection is Robinson (the foundation for Equal Earth), which lives in d3-geo-projection. To use it, we can import the module with require() and then access it with d3_geo_projection.geoRobinson():

d3_geo_projection = require("d3-geo-projection") 

Plot.plot({
  projection: d3_geo_projection.geoRobinson(), 
  width: 1000,
  height: 500, 
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Filtering map data and adjusting projections

Removing elements

Now that we have a nice projection, we can tweak the map a little. Antarctica is taking up a big proportion of the southern hemisphere, so we’ll filter it out. The world object that has all the map data keeps each country object inside a features slot:

world.features

We can filter it using Javascript’s .filter() function. To make sure that the resulting array keeps the geographic-ness of the data and is a FeatureCollection, we need to create a similarly structured object, with type and features slots:

world_sans_penguins = ({ 
  type: "FeatureCollection", 
  features: world.features.filter(d => d.properties.iso_a3 !== "ATA") 
}) 

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 1000,
  height: 500,
  marks: [
    Plot.sphere({ fill: clr_ocean }),
    Plot.geo(world_sans_penguins, { 
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

That works and Antarctica is gone, as expected, but in reality the map didn’t actually change that much. Even if we stop using the sphere background and just fill the plot frame, we can see that the area where Antarctica was is still there, it’s just missing the land itself:

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 1000,
  height: 500,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }), 
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Quick and dirty cheating method: change the width or height

One quick and dirty solution is to mess with the dimensions and shrink the height. After some trial and error, 430 pixels looks good:

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 1000,
  height: 430, 
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

While this works in this case, it’s not a universal solution. The only reason this works is because Antarctica happens to be at the bottom of the map. When you adjust the height of the plot area, the map itself is anchored to the top. Like, if we set the height to 215, we’ll get just the northern hemisphere:

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 1000,
  height: 215, 
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

As far as I can tell, there’s no way to anchor the map in any other position. If we filter the map data to only look at one continent, there’s no easy way to focus on just that continent by adjusting only the width or height options. Here’s Africa all by itself in a big empty plot area:

just_africa = ({ 
    type: "FeatureCollection", 
    features: world.features.filter(d => d.properties.continent == "Africa") 
}) 

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 1000,
  height: 430, 
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(just_africa, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

If we adjust the width or the height, the plot area will be resized with the map anchored in the top left corner so we’re left with just the northwestern part of Africa (and big empty areas where North America, South America, and Europe would be):

Plot.plot({
  projection: d3_geo_projection.geoRobinson(),
  width: 550, 
  height: 215, 
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(just_africa, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

That’s not great, but there are better ways!

Built-in projections and domain settings

The official Observable Plot method for fitting the plot window to a specific area of the map is to define a “domain” for one of the built-in projections to zoom in on specific areas. The documentation shows how to use special functions in d3-geo to create a circle around a point, but you can also pass a GeoJSON object and Plot will use its boundaries for the domain. The built-in projection options also let us control the outside margin of the domain with inset.

Here’s the world map without Antarctica with the Equal Earth projection, with the projection resized to fit within the bounds of world_sans_penguins, with 10 pixels of padding around the landmass. Antarctica is gone now and the rest of the map is vertically centered within the plot area:

Plot.plot({
  projection: { 
    type: "equal-earth", 
    domain: world_sans_penguins, 
    inset: 10 
  }, 
  width: 1000,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

We can see what’s happening behind the scenes if we add Plot.sphere() back in. The rounded globe area is still there, but it’s shifted down and out of the frame. We’re essentially panning around and zooming in on the Equal Earth projection:

Plot.plot({
  projection: {
    type: "equal-earth",
    domain: world_sans_penguins, 
    inset: 10
  },
  width: 1000,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1 }), 
    Plot.sphere({ fill: clr_ocean }), 
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Passing a GeoJSON object as the domain is really neat because it makes it straightforward to zoom in on specific areas. For instance, here’s the complete medium resolution world map zoomed in around the just_africa object, which keeps non-African countries in the Middle East and southern Europe:

Plot.plot({
  projection: {
    type: "equal-earth",
    domain: just_africa, 
    inset: 10
  },
  width: 600, 
  height: 600, 
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_medium, { 
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

We could also extract Africa from the medium resolution world map and plot only that continent, omitting the Middle East and Europe:

just_africa_medium = ({ 
    type: "FeatureCollection", 
    features: world_medium.features.filter(d => d.properties.continent == "Africa") 
}) 

Plot.plot({
  projection: {
    type: "equal-earth",
    domain: just_africa,
    inset: 10
  },
  width: 600,
  height: 600,
  marks: [
    // Use a white background since we don't want to make it look like  
    // the Sinai peninsula has a coastline  
    Plot.frame({ fill: "white", stroke: "black", strokeWidth: 1 }), 
    Plot.geo(just_africa_medium, {  
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Other projections and `.fitExtent()`

Unfortunately, it’s a little bit trickier to set the domain and inset for projections that aren’t built in to Plot. We can’t do this:

Plot.plot({
  projection: {
    type: d3_geo_projection.geoRobinson(),
    domain: world_sans_penguins,
    inset: 10
  },
  ...
})

Instead, we need to adjust the size of the projection window itself and build in the inset with d3-geo’s .fitExtent(). This function takes four arguments in an array like [[x1, y1], [x2, y2]], defining the top left and bottom right corners (in pixels) of a window that is centered in the middle of a given GeoJSON object. Here, for instance, we create a copy of the Robinson projection that has a window around just_africa with a top left corner at (30, 30) and a bottom right corner at (570, 570):

inset_africa = 30
africa_map_width = 600
africa_map_height = 600

africa_robinson = d3_geo_projection.geoRobinson()
  .fitExtent(
    [[inset_africa, inset_africa],  // Top left
     [africa_map_width - inset_africa, africa_map_height - inset_africa]],  // Bottom right 
    just_africa
  )

Plot.plot({
  projection: africa_robinson,
  width: africa_map_width,
  height: africa_map_height,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

We can use the same approach with individual countries. For extra fun, we’ll fill these countries with distinct colors using Natural Earth’s mapcolor7 column, which assigns countries one of 7 different colors that don’t border other countries (so neighboring countries will never be the same color). We’ll also add some labels in the middle of each country.

egypt = world_medium.features.find(d => d.properties.name === "Egypt")

inset_egypt = 75
egypt_map_width = 600
egypt_map_height = 600

robinson_egypt = d3_geo_projection.geoRobinson()
  .fitExtent(
    [[inset_egypt, inset_egypt],  // Top left
     [egypt_map_width - inset_egypt, egypt_map_height - inset_egypt]],  // Bottom right
    egypt
  )

Plot.plot({
  projection: robinson_egypt,
  width: egypt_map_width,
  height: egypt_map_height,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_medium, Plot.centroid({
      fill: d => d.properties.mapcolor7,
      stroke: "black", 
      strokeWidth: 0.5
    })),
    Plot.geo(egypt, { stroke: "yellow", strokeWidth: 3 }),
    Plot.tip(world_medium.features, Plot.centroid({
      title: d => d.properties.name, 
      anchor: "top",
      fontSize: 13,
      fontWeight: "bold",
      textPadding: 3
    }))
  ],
  color: {
    range: carto_prism
  }
})

The approach works for the whole world_sans_penguins object as well. This addresses our original problem—here’s a world map with the Robinson projection without Antarctica that fills the plot area correctly:

inset_world = 10
world_map_width = 1000
world_map_height = 450

world_sans_penguins_robinson = d3_geo_projection.geoRobinson()
  .fitExtent(
    [[inset_world, inset_world],
     [world_map_width - inset_world, world_map_height - inset_world]],
    world_sans_penguins
  )

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5,
      fill: clr_land
    })
  ]
})

Arbitrary areas and `.fitExtent()`

For bonus fun, this approach also works for any arbitrary rectangles. For example, we can use OpenStreetMap’s neat Export tool to pick the top, bottom, left, and right edges of a box that focuses on Western Europe.

Rectangle around western Europe with OpenStreetMap’s Export tool

We can then use those coordinates to create a MultiPoint geometric feature/object, which essentially acts like a rectangular fake country/region that can be used as the domain or extent of the map:

inset_europe = 10
europe_map_width = 800
europe_map_height = 800

europe_box = ({
  type: "Feature",
  geometry: {
    type: "MultiPoint",
    coordinates: [
      [-13, 35],  // [left/west, bottom/south] (or bottom left corner)
      [21, 60]    // [right/east, top/north]   (or top right corner)
    ]
  }
})

europe_robinson = d3_geo_projection.geoRobinson()
  .fitExtent(
    [[inset_europe, inset_europe],
     [europe_map_width - inset_europe, europe_map_height - inset_europe]], 
    europe_box
  )

Plot.plot({
  projection: europe_robinson,
  width: europe_map_width,
  height: europe_map_height,
  marks: [
    Plot.frame({ fill: clr_ocean, stroke: "black", strokeWidth: 1 }),
    Plot.geo(world_medium, Plot.centroid({
      fill: d => d.properties.mapcolor9,
      stroke: "white", 
      strokeWidth: 0.25
    })),
    Plot.tip(world.features, Plot.centroid({
      title: d => d.properties.name, 
      anchor: "bottom",
      fontSize: 13,
      fontWeight: "bold",
      textPadding: 3
    }))
  ],
  color: {
    range: carto_prism
  }
})

Working with USAID data

Get USAID data

To make it easier to access and filter and manipulate things, I put the rescued data on a Datasette instance, which is nice front-end for an SQLite database. This makes it possible to run SQL queries directly in the browser and generate custom datasets without needing to load the full massive CSV files into R or Python or Stata or whatever.

For example, one of the rescued USAID datasets is named us_foreign_aid_country and it contains 22,000+ rows, with data on aid obligations, appropriations, and disbursements starting in 1999.

If we want to get a total of all constant USD aid obligations by country in 2023, omitting regional and world totals, we could do something like this with R and {dplyr}:

library(tidyverse)

# Download the raw CSV and put it somewhere
us_foreign_aid_country <- read_csv("us_foreign_aid_country.csv")

us_foreign_aid_country |>
  filter(
    `Fiscal Year` == 2023, 
    `Transaction Type Name` == "Obligations",
    !str_detect(`Country Name`, "Region"),
    `Country Name` != "World"
  ) |>
  group_by(`Country Code`, `Country Name`, `Region Name`) |>
  summarize(total_constant_amount = sum(constant_amount)) |>
  arrange(desc(total_constant_amount))
#>  A tibble: 176 × 4
#>  Groups:   Country Code, Country Name [176]
#>   `Country Code` `Country Name`   `Region Name`                total_constant_amount
#>                                                                 
#> 1 UKR            Ukraine          Europe and Eurasia                     17193710403
#> 2 ISR            Israel           Middle East and North Africa            3302860882
#> 3 JOR            Jordan           Middle East and North Africa            1686862605
#> 4 EGY            Egypt            Middle East and North Africa            1503609426
#> 5 ETH            Ethiopia         Sub-Saharan Africa                      1457374911
#> 6 SOM            Somalia          Sub-Saharan Africa                      1181033990
#> 7 NGA            Nigeria          Sub-Saharan Africa                      1019947490
#> 8 COD            Congo (Kinshasa) Sub-Saharan Africa                       990456757
#> 9 AFG            Afghanistan      South and Central Asia                   886536741
#> 0 KEN            Kenya            Sub-Saharan Africa                       846303488
#>  ℹ 166 more rows
#>  ℹ Use `print(n = ...)` to see more rows

Or we could get that data extract directly from the database without needing to load the huge original CSV file. We can run an SQL query like this at the Datasette website:

SELECT "Country Code", "Country Name", "Region Name", SUM("constant_amount") AS total_constant_amount
  FROM "./us_foreign_aid_country"
  WHERE 
    "Fiscal Year" = '2023' 
    AND "Transaction Type Name" = 'Obligations' 
    AND "Country Name" NOT LIKE '%Region%' 
    AND "Country Name" != "World"
  GROUP BY "Country Code", "Country Name", "Region Name"
  ORDER BY total_constant_amount DESC;

SQL query and results

Since we’re working with interactive Observable Javascript, we can load that data directly into the browser instead of downloading intermediate CSV files. There’s a neat Datasette database client for Observable that lets us run SQL queries (there are lots of other clients too, if you want to connect to things like DuckDB, SQLite, MySQL, Snowflake, and so on).

import { DatasetteClient } from "@ambassadors/datasette-client"

aid_db = new DatasetteClient(
  "https://foreignassistance-data.andrewheiss.com/2025-02-03_foreign-assistance"
)

recipient_countries = await aid_db.sql`
  SELECT "Country Code", "Country Name", "Region Name", SUM("constant_amount") AS total_constant_amount
  FROM "./us_foreign_aid_country"
  WHERE 
    "Fiscal Year" = '2023' 
    AND "Transaction Type Name" = 'Obligations' 
    AND "Country Name" NOT LIKE '%Region%' 
    AND "Country Name" != "World"
  GROUP BY "Country Code", "Country Name", "Region Name"
  ORDER BY total_constant_amount DESC;
`

Through the magic of this Datasette client, we now have a pre-summarized dataset to work with!

// I don't want to keep hitting the Datasette server with requests, so I'm 
// cheating and loading a CSV extract instead. It comes from this query: 
// https://foreignassistance-data.andrewheiss.com/2025-02-03_foreign-assistance.csv?sql=SELECT+%22Country+Code%22%2C+%22Country+Name%22%2C+%22Region+Name%22%2C+SUM%28%22constant_amount%22%29+AS+total_constant_amount%0D%0A++FROM+%22.%2Fus_foreign_aid_country%22%0D%0A++WHERE+%22Fiscal+Year%22+%3D+%272023%27+%0D%0A++++AND+%22Transaction+Type+Name%22+%3D+%27Obligations%27+%0D%0A++++AND+%22Country+Name%22+NOT+LIKE+%27%25Region%25%27+%0D%0A++++AND+%22Country+Name%22+%21%3D+%22World%22%0D%0A++GROUP+BY+%22Country+Code%22%2C+%22Country+Name%22%2C+%22Region+Name%22%0D%0A++ORDER+BY+total_constant_amount+DESC%3B&_size=max
recipient_countries = await FileAttachment("recipient_countries.csv").csv({ typed: true })

recipient_countries

Connect USAID data to the map data

Following Observable Plot’s choropleth tutorial, to show these totals on a map, we need to create a Map object,¹ which is like a Python dictionary or an R data frame with two columns, where we have (1) a name that shares a name with something in the geographic data, like an ISO3 country code, and (2) a value with the thing we want to plot.

¹ This term is admittedly confusing because it has nothing to do with geographic maps and is instead related to functional programming.

country_totals = new Map(recipient_countries.map(d => [d["Country Code"], d.total_constant_amount]))
country_totals

This lets us get specific totals with the .get() method. Here’s Ukraine, for example:

country_totals.get("UKR")

We can feed the ISO3 code of each country-level geographic shape into this country_totals object to extract the total amount of aid for each country. We’ll use the Antarctica-free Robinson projection we made earlier, and we’ll remove the ocean fill since we’ll ultimately make this interactive and hoverable:

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3)
    }))
  ]
})

Improving the map

We have a choropleth! But this is hardly publication worthy. We need to fix a bunch of issues with it.

First, countries that don’t receive aid don’t appear in the map. Let’s add borders to all the countries:

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3)
    })),
    Plot.geo(world_sans_penguins, {  
      stroke: "black",  
      strokeWidth: 0.5  
    })  
  ]
})

The coloring here is gross because of some huge outliers (Ukraine) that make most countries black/dark blue. There’s also no legend to show what these values are. We can address all of this by adjusting the legend options. We’ll log total aid, include the legend, add a nice label, and use a single-hue coloring scheme with gray for countries without aid:

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3)
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5
    })
  ],
  color: {  
    scheme: "blues",  
    unknown: "#f2f2f2",  
    type: "log",   
    legend: true,  
    label: "Total obligations",  
  }  
})

Next, let’s make this interactive by turning on hovering tooltips:

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3),
      tip: true 
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5
    })
  ],
  color: {
    scheme: "blues",
    unknown: "#f2f2f2",
    type: "log", 
    legend: true,
    label: "Total obligations",
  }
})

That’s so cool. Hover over Mexico and you’ll see “Total obligations 232,214,023”.

We can make this tooltip more informative by including the country name and formatting the amount to show dollars. Instead of using tip: true, we can add the country name as a channel (Observable Plot’s version of a ggplot aesthetic), and format the tip so that the country name comes first and the total amount is formatted with d3.format():

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3),
      channels: { 
        Country: d => d.properties.name, 
      }, 
      tip: { 
        format: { 
          Country: true, 
          fill: d3.format("$,d") 
        } 
      } 
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5
    })
  ],
  color: {
    scheme: "blues",
    unknown: "#f2f2f2",
    type: "log", 
    legend: true,
    label: "Total obligations",
  }
})

Now hover over Mexico and you’ll see the country name and the amount of aid in dollars.

Fixing labelling issues

We have two final super minor issues to address.

First hover over a country that didn’t receive aid, like the United States or Australia. The total reported aid displays as “$NaN”. That’s gross. It’d be nicer if it said something else, like “$0” or “No aid” or something more informative.

To fix this, we can make a little function that formats the given value as a dollar amount if it’s an actual value, and formats it as something else if it’s missing or not a number (like log(0)):

function format_aid_total(value) {
  return value ? d3.format("$,d")(value) : "No aid";
}

That works nicely:

format_aid_total(394023)

format_aid_total(NaN)

The other problem is in the legend, which uses a logarithmic scale and includes breaks for 10k, 1M, 100M, and 10G, representing $10,000, $1 million, $100 million, and $10 billion in aid.

The issue is the $10 billion, which is abbreviated with “G”.

This is happening because d3.format() uses SI (Système international d’unités, or International System of Units) values for its numeric formats, which means that it uses SI metric prefixes. Those legend breaks, therefore, actually technically mean this:

10k: 10 kilodollars
1M: 1 megadollar
100M: 100 megadollars
10G: 10 gigadollars

lol, I should start talking about big dollar amounts with these values (“the 2022 US federal budget deficit was 1.4 teradollars”)

The first letters of many of these SI prefixes happen to line up with US-style large numbers:

In the US we already commonly use “k” for thousand
The initial “m” in “mega” aligns with “million”
The initial “t” in tera aligns with “trillion”

But “giga” doesn’t align with “billion”, hence the strange “G” here for dollar amounts.

People have requested that d3-format include an option for switching the abbreviation from G to B, but the developers haven’t added it (and probably won’t). Instead, a common recommended fix is to replace all “G”s with “B”s:

number_in_billions = 13840918291  // A big number I randomly typed

// Billions of dollars instead of SI-style gigadollars
d3.format("$.4s")(number_in_billions).replace("G", "B")

We can add format_aid_total() and the .replace("G", "B") tweak and fix the labels in our interactive map:

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3),
      channels: {
        Country: d => d.properties.name,
      },
      tip: {
        format: {
          Country: true,
          fill: d => format_aid_total(d) 
        }
      }
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.5
    })
  ],
  color: {
    scheme: "blues",
    unknown: "#f2f2f2",
    type: "log", 
    legend: true,
    label: "Total obligations",
    tickFormat: d => d3.format("$0.2s")(d).replace("G", "B") 
  }
})

Some final tweaks

We’re so close! Just a couple final incredibly minor changes:

We’ll boost the font size of the tooltip a little and increase the font size of the legend
We’ll switch from the built-in ColorBrewer blues palette to show how to use custom gradients, like CARTOColors’s PurpOr sequential palette

Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3),
      channels: {
        Country: d => d.properties.name,
      },
      tip: {
        fontSize: 12, 
        format: {
          Country: true,
          fill: d => format_aid_total(d)
        }
      }
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.15
    })
  ],
  color: {
    // scheme: "blues", 
    range: ["#f9ddda", "#f2b9c4", "#e597b9", "#ce78b3", "#ad5fad", "#834ba0", "#573b88"], 
    unknown: "#f2f2f2",
    type: "log", 
    legend: true,
    label: "Total obligations",
    tickFormat: d => d3.format("$0.2s")(d).replace("G", "B"),
    style: { 
      "font-size": "14px" 
    } 
  }
})

The full game: Complete final code

That final interactive map looks great! We could be even fancier with it by adding dropdowns for dynamically grabbing data for different years or different types of amounts (appropriations, allocations, etc.), or even filter by specific regions or countries. But we won’t.

The different colors and data sources we’ve used are scattered throughout this post. To simplify things, here’s the complete code all in one location. (This chunk doesn’t actually run, since Observable gets mad if you create a new variable with the same name as one that already exists.)

d3_geo = require("d3-geo")
d3_geo_projection = require("d3-geo-projection")

// ----------------------------------------------------------------------
// Map stuff
// ----------------------------------------------------------------------
world = FileAttachment("ne_110m_admin_0_countries.geojson").json()

// Antarctica's ISO3 code is ATA
world_sans_penguins = ({
  type: "FeatureCollection",
  features: world.features.filter(d => d.properties.iso_a3 !== "ATA")
})

inset_world = 10
world_map_width = 1000
world_map_height = 450

world_sans_penguins_robinson = d3_geo_projection.geoRobinson()
  .fitExtent(
    [[inset_world, inset_world],
     [world_map_width - inset_world, world_map_height - inset_world]],
    world_sans_penguins
  )

// ----------------------------------------------------------------------
// Data stuff
// ----------------------------------------------------------------------
import { DatasetteClient } from "@ambassadors/datasette-client"

aid_db = new DatasetteClient(
  "https://foreignassistance-data.andrewheiss.com/2025-02-03_foreign-assistance"
)

recipient_countries = await aid_db.sql`
  SELECT "Country Code", "Country Name", "Region Name", SUM("constant_amount") AS total_constant_amount
  FROM "./us_foreign_aid_country"
  WHERE 
    "Fiscal Year" = '2023' 
    AND "Transaction Type Name" = 'Obligations' 
    AND "Country Name" NOT LIKE '%Region%' 
    AND "Country Name" != "World"
  GROUP BY "Country Code", "Country Name", "Region Name"
  ORDER BY total_constant_amount DESC;
`

country_totals = new Map(recipient_countries.map(d => [d["Country Code"], d.total_constant_amount]))

function format_aid_total(value) {
  return value ? d3.format("$,d")(value) : "No aid";
}

// ----------------------------------------------------------------------
// Plot stuff
// ----------------------------------------------------------------------
Plot.plot({
  projection: world_sans_penguins_robinson,
  width: world_map_width,
  height: world_map_height,
  marks: [
    Plot.frame({ stroke: "black", strokeWidth: 1} ),
    Plot.geo(world_sans_penguins, Plot.centroid({
      fill: d => country_totals.get(d.properties.iso_a3),
      channels: {
        Country: d => d.properties.name,
      },
      tip: {
        fontSize: 12,
        format: {
          Country: true,
          fill: d => format_aid_total(d)
        }
      }
    })),
    Plot.geo(world_sans_penguins, {
      stroke: "black",
      strokeWidth: 0.15
    })
  ],
  color: {
    range: ["#f9ddda", "#f2b9c4", "#e597b9", "#ce78b3", "#ad5fad", "#834ba0", "#573b88"],
    unknown: "#f2f2f2",
    type: "log", 
    legend: true,
    label: "Total obligations",
    tickFormat: d => d3.format("$0.2s")(d).replace("G", "B"),
    style: {
      "font-size": "14px" 
    }
  }
})

Citation

BibTeX citation:

@online{heiss2025,
  author = {Heiss, Andrew},
  title = {Using {USAID} Data to Make Fancy World Maps with {Observable}
    {Plot}},
  date = {2025-02-10},
  url = {https://www.andrewheiss.com/blog/2025/02/10/usaid-ojs-maps/},
  doi = {10.59350/c0aep-hp989},
  langid = {en}
}

For attribution, please cite this work as:

Heiss, Andrew. 2025. “Using USAID Data to Make Fancy World Maps with Observable Plot.” February 10, 2025. https://doi.org/10.59350/c0aep-hp989.

Guide to comparing sample and population proportions with CPS data, both classically and Bayesianly

Andrew Heiss — Mon, 27 Jan 2025 05:00:00 GMT

Last week I was making some final revisions to a paper where we used a neat conjoint experiment to test the effect of a bunch of different treatments on nonprofit donor preference.

One of the peer reviewers asked us to compare the characteristics of our experimental sample with the general population so that we could speak a little to the experiment’s generalizability. This is a super common thing to do with survey research, and one of the main reasons survey researchers include demographic questions in their surveys.

Thanks to the wonders of the R community—and thanks to publicly accessible data—I was able to grab nationally representative demographic data, clean it up and summarize it, run some statistical tests, and make a table to meet the reviewer’s request, all in like 45 minutes.

It was a magically quick and easy process, so I figured I’d make a guide about it so that the rest of the world (but mostly future me) can see how to do it.

Nationally representative demographic data

Finding nationally representative demographic data (in the US, at least) is pretty easy, and there are two common sources for it:

The US Census’s American Community Survey (ACS) is a rolling monthly survey of ≈3.5 million (!!!) US households that’s compiled into an annual dataset.
The US Census’s Current Population Survey (CPS) is a monthly survey of ≈100,000 US individuals. A more comprehensive annual version—the Annual Social and Economic Supplement (ASEC)—is published every March.

The two surveys serve different purposes, and the Census has an FAQ fact sheet explaining the difference between the ACS and CPS. Notably, the ACS only surveys households, and it uses a shorter 8-question survey, while the CPS tries to reach the entire civilian noninstitutionalized population and uses a longer, more detailed survey.

Researchers use both surveys—I’ve used both in my own work. According to the Census, due to its detailed questionnaire and staff experience and regular frequency, the CPS ASEC is a “high quality source of information used to produce the official annual estimate of poverty, and estimates of a number of other socioeconomic and demographic characteristics”. The ACS also has demographic details and people use those instead too.

I’m not entirely sure which one is best—to me they’re both great 🤷‍♂️. Smarter people than me know and care about the difference.

Accessing US Census data

Getting data from the Census is a surprisingly complex process! There are websites and R packages that make it easier though.

ACS

For the ACS, the {tidycensus} R package provides an interface to the Census’s API, and its documentation is great and thorough. Working with the results is tricky though, and involves a lot of pivoting and reshaping and combining variables. I have a whole notebook showing how I access the ACS and create a bunch of variables, with little notes reminding myself how I constructed everything:

Explanation of how I calculated the proportion of households in a block group with a high school education

CPS (and others!)

{tidycensus} doesn’t provide Census API access to CPS data. Instead, IPUMS—a project housed at the University of Minnesota and supported by a consortium of other institutions and companies—provides easy access to all sorts of census and survey data, both through its website and through an API. It’s wild how much data they have. In addition to the CPS, they have the ACS, census microtata for 100+ countries, historical GIS shapefiles, time use surveys, and a ton of other things. It’s an incredible project.

Getting started

Who this guide is for

Here’s what I assume you know:

You’re familiar with R and the tidyverse (particularly {dplyr} and {ggplot2}).
You’re familiar with {brms} for running Bayesian regression models and {tidybayes} and {ggdist} for manipulating and plotting posterior draws.

In this guide, we’ll use IPUMS to get CPS data (monthly and ASEC) with R. Because their data explorer website takes a little while to get used to, I’ll show a bunch of step-by-step screenshots of how to navigate it. It’s possible to access the IPUMS API with the {ipumsr} package, and I also show how to do that in this guide. But in order to use the API, you still need to know how to use the website—you have to find variable names and figure out which samples include which variables. So the screenshots below are still important even if you’re using the API.

I’ll then show how to answer the question of whether a survey proportion is equivalent to a population proportion in a couple diffferent ways:

Frequentist/classical proportion tests with null hypothesis significance testing, and
Bayesian proportion tests and inference based on regions of practical equivalence, or ROPEs

Before getting started, let’s load all the packages we need and create some helpful functions and variables.

The experimental survey data here comes from Chaudhry, Dotson, and Heiss (2024). Since we haven’t published it yet (though it’s close—it’s under review post R&R now!), the original data isn’t quite public yet. So I used the {synthpop} R package to create a synthetic version of part of our data that has the same relationships and distributions as the real results, but is all fake. You can see the R code I used for that process here.

If you want to follow along, you can download this synthetic data here:

synthetic_data.rds: RDS version with all the variable attributes (i.e. factor levels and ordering) included
synthetic_data.csv: Plain-text version (but without variable attributes)

To reflect the fact that this is all public, government-created data, I’m using the Public Sans font, an open source font developed as part of the General Services Administration’s USWDS (US Web Design System) for making accessible federal government websites. I’m also using the USWDS’s basic color palette, developed by 18F.

Let’s get started!

library(tidyverse)   # {ggplot2}, {dplyr}, and friends
library(tinytable)   # Nice tables
library(brms)        # Best way to run Stan models
library(tidybayes)   # Manipulate Stan objects and draws
library(broom)       # Convert model objects to data frames
library(glue)        # Easier string construction
library(scales)      # Nicer labels
library(ggdist)      # Plot posterior distributions
library(ggforce)     # Extra ggplot things like facet_col()
library(patchwork)   # Combine ggplot plots

# Load the synthetic survey results
results <- readRDS("synthetic_data.rds")

# Use the cmdstanr backend for brms because it's faster and more modern than the
# default rstan backend. You need to install the cmdstanr package first
# (https://mc-stan.org/cmdstanr/) and then run cmdstanr::install_cmdstan() to
# install cmdstan on your computer.
options(
  mc.cores = 4,
  brms.backend = "cmdstanr"
)

# Set some global Stan options
CHAINS <- 4
ITER <- 2000
WARMUP <- 1000
BAYES_SEED <- 1234

# Nice ggplot theme
theme_public <- function() {
  theme_minimal(base_family = "Public Sans") +
    theme(
      panel.grid.minor = element_blank(),
      plot.title = element_text(family = "Public Sans", face = "bold", size = rel(1.25)),
      plot.subtitle = element_text(family = "Public Sans Light", face = "plain"),
      plot.caption = element_text(family = "Public Sans Light", face = "plain"),
      axis.title = element_text(family = "Public Sans Semibold", size = rel(0.8)),
      axis.title.x = element_text(hjust = 0),
      axis.title.y = element_text(hjust = 1),
      strip.text = element_text(
        family = "Public Sans Semibold", face = "plain",
        size = rel(0.8), hjust = 0
      ),
      strip.background = element_rect(fill = "grey90", color = NA),
      legend.title = element_text(family = "Public Sans Semibold", size = rel(0.8)),
      legend.text = element_text(size = rel(0.8)),
      legend.position = "bottom",
      legend.justification = "left",
      legend.title.position = "top",
      legend.margin = margin(l = 0, t = 0)
    )
}

theme_set(theme_public())
update_geom_defaults("text", list(family = "Public Sans"))
update_geom_defaults("label", list(family = "Public Sans"))

# USWDS basic palette
# https://designsystem.digital.gov/utilities/color/#basic-palette-2
clrs <- c(
  "#e52207", # .bg-red
  "#e66f0e", # .bg-orange
  "#ffbe2e", # .bg-gold
  "#fee685", # .bg-yellow
  "#538200", # .bg-green
  "#04c585", # .bg-mint
  "#009ec1", # .bg-cyan
  "#0076d6", # .bg-blue
  "#676cc8", # .bg-indigo
  "#8168b3", # .bg-violet
  "#d72d79" # .bg-magenta
)

# Some functions for creating percentage point labels
label_pp <- label_number(
  accuracy = 1, scale = 100, suffix = " pp.", style_negative = "minus"
)

label_pp_01 <- label_number(
  accuracy = 0.1, scale = 100, suffix = " pp.", style_negative = "minus"
)

Getting CPS data from the IPUMS website

Go to the IPUMS CPS website and create an account if you don’t already have one.

Once you’re logged in, go the “Select Data” page, where IPUMS tells you to do two things:

Select samples, or specific verisons of different surveys
Select variables, or specific columns in different surveys

Initial data extract page

Visually it looks like you should select samples first, but I actually find it easier to poke around for different variables first, since not all variables are recorded in every sample.

Finding variables

So first let’s look at a few variables to get a feel for the IPUMS data extract website. Click on the little “SEARCH 🔍” button in the “SELECT VARIABLES” section. We could do a bunch of fancy advanced search options, but for now, just search for “age”

Searching for age-related variables

There are 200+(!) age-related variables in the CPS data:

Search results related to age

This big list shows some useful information already. The first variable in the search results is AGE, and it’s the one we care about. It gets recorded in every CPS survey: each monthly one and in the annual ASEC one. Not all the age variables do this—notice that WHYSS1 only appears in the annual ASEC.

If you click on AGE in the “Variable” column, you can see detailed information about the variable, like how it’s coded, a description, and its availability. For example, prior to 1976, age was only available in the annual ASEC; starting in January 1976, it became a monthly thing.

AGE availability

Since we know we want this variable, we can add it to our “Data Cart”. IPUMS ues a shopping metaphor for building a data extract—we can add different variables and samples to a cart and then check it out (for free) once we’ve found everything we’re looking for.

Click on “Add to cart” to add it, then go back to the search page to look for more variables. You can also add variables to the cart without going to the variable details page—there’s a plus sign in the search results page next to each result that will add the variable for you.

Add AGE to cart

Now that we have age, we need to hunt around for other variables we care about, like sex, marital status, voting history, and so on. To speed things up, you can search for their official variable names and add each one to the cart:

Sex = SEX
Marital status = MARST
Education = EDUC
Donating = VLDONATE
Volunteering = VLSTATUS
Volunteering Supplement weight = VLSUPPWT
Voting = VOTED
Voter Supplement weight = VOSUPPWT

Pay attention to the details!

Looking at the details for these variables is helpful since they’re all categorical variables, unlike age. For instance, marital status has 9 different levels:

Nine different marital status (MARST) codes

Also, it’s important to check the variable details to check for availability. While basic demographic variables like age, sex, marital status, etc. are available in both the monthly surveys and in the annual ASEC, more specialized variables are not.

Variables related to philanthropy and volunteering are only available in September (since they’re part of a special CPS Volunteer Supplement), and only in some years:

Volunteer status (VLSTATUS) only available in September, and only in some years

Variables related to voting are only available in November in even-numbered years (since they’re part the CPS’s Voting and Registration Supplement)

Voting status (VOTED) only available in even-numbered years in November

Selecting samples

Great! If you check the cart, you’ll see all the variables we added at the bottom, along with a bunch of other pre-selected columns:

All variables added to the cart, but no samples are selected

We can’t download any data yet, though. We’ve selected the variables—now we need to select the samples. Go back to the “Select data” page and click on “Select samples” (or click on the “Add more samples” button at the top of the data cart page).

By default, IPUMS will have a bunch of different samples pre-checked. In my case, it grabbed all annual ASEC surveys from 2010–2024, and all monthly surveys from 2021–2024.

Pre-selected ASEC samples
Pre-selected monthly samples

Pre-selected ASEC samples

Pre-selected monthly samples

Including all these samples would be useful if we were doing some sort of analysis of CPS trends over time, comparing changes in age or education or volunteering or whatever. But that doesn’t matter here—all we want to know is what age (and everything else) looked like at the time the survey was administered. That means we really just need one year.

However, we can’t just choose one sample. Things like demographics are availble in all annual and monthly samples, but volunteering is only available in September in specific years, and voting is only available in November in specific years.

This survey was administered in mid-2019, so we’ll choose samples that are as close to that as possible. Though demographics are available both monthly and annually, I like to use the annual versions because ASEC data is typically used to stand in for annual information—like if you were building a state-year panel dataset, you’d use ASEC data for each year. The ASEC occurs in March and actually overlaps with the monthly March data (IPUMS has a note about that), so the data technically is for March 2019, but whatever. We don’t have to be super precise here.

We’ll use the September 2019 sample for volunteering, even though that’s after the survey was administered. The next earliest volunteering data is the September 2017 sample, which is like 2 years before the survey. Things don’t line up precisely, but again, that’s fine.

Finally, we’ll use the November 2018 sample for voting. That’s before the survey, but it’s the closest we can get—the next alternative is November 2020, which is a year after the survey. Once again, nothing lines up exactly, but it’s fine.

In summary, here are the variables we want and the samples we’ll get them from:

Age (AGE): 2019 ASEC
Sex (SEX): 2019 ASEC
Marital status (MARST): 2019 ASEC
Education (EDUC): 2019 ASEC
Donating (VLDONATE): 2019-09 Monthly
Volunteering (VLSTATUS): 2019-09 Monthly
Volunteering Supplement weight (VLSUPPWT): 2019-09 Monthly
Voting (VOTED): 2018-11 Monthly
Voter Supplement weight (VOSUPPWT): 2018-11 Monthly

Select those three samples (2019 ASEC, September 2019, and November 2018) and click on “Submit sample selections” to add them to the cart.

Specific ASEC sample
Specific monthly samples

March 2019 ASEC sample

September 2019 and November 2018 monthly samples

The cart should now have 9 variables and 3 samples. Conveniently, it has a little summary table showing which samples have which variables, where we can confirm that age, sex, marital status, and education are in all three, volunteering and donating are only in September 2019, and voting is only in November 2018.

All variables and samples selected and ready to go

Downloading the data

Now that we have all the variables and samples we care about in the data cart, we can create a data extract and download this stuff.

Click on the “Create data extract” button at the top of the data cart page, which will take you to the official Extract Request page. There are a bunch of extra options here, and you can optionally add a description to the extract, but we’ll ignore all those. Click on the “Submit Abstract” button and wait for the IPUMS server to compile it all.

Extract submission page

Once it’s ready, it’ll appear at your “My Data” page, which will have a list of all your past extracts.

To download the data, we actually need to download two things:

The data itself. Click on the big green “Download .DAT” button.

Button for download a .dat version of the data

.dat vs .dat.gz

Depending on your browser, the downloaded file will either end in .dat or .dat.gz. If it ends in .gz, it’ll be compressed and zipped (the compressed version of this extract is 7.6 MB); if it ends in .dat, it’ll be uncompressed and huge (≈55 MB in this case). Chrome and Firefox will keep the compressed .gz version; Safari will automatically unzip it and throw away the .gz version, which is annoying.

Try to keep the compressed version. You don’t even need to extract/unzip it—the {ipumsr} data loading functions will handle unzipping for you automatically behind the scenes.

The machine-readable XML codebook, or the DDI file. R uses to clean and relabel the raw data when you load it. If you click on the DDI link in the Codebook column, your browser will likely open a plain text XML file, which isn’t really what you want. Instead, right click on the DDI link and choose “Save file as…” or “Download file as…” or whatever your browser calls it. This will let you save the XML file to your computer.

Context menu for the DDI codebook link

There are some other helpful links there too:

If you click on the R link, it’ll give you a barebones R script for loading the data. It’ll look something like this:

# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)

If you click on the basic codebook, you’ll get a short plain text version of the codebook, which I find really helpful for remembering which variables show up where and how each variable is coded.

Move the newly downloaded .dat and the .ddi files to the same folder somewhere on your computer (preferably in an RStudio Project or a Positron project/folder or wherever your R working directory is). I put mine in a folder named raw_data; you can put it wherever.

We’re finally ready to load this CPS data into R!

More reproducible alternative: using the IPUMS API

Alternatively, it’s possible to use the {ipumsr} package to access the IPUMS API directly and not need to manually download the data extract from the IPUMS website.

The {ipumsr} API functions essentially let you programmatically create and download a data extract cart. Unless you know the IPUMS CPS data really well, you’ll still likely need to hunt around the website for specific variables and their availabilities, so the whole previous section is still relevant.

The {ipumsr} vignette for working with the API is nice and complete—see that for full details.

Here’s an abbreviated example of how to get the same data extract we collected manually from the website:

Go to your IPUMS dashboard and create an API key. This needs to be stored as an environment variable named IPUMS_API_KEY. You can manually add this to your .Renviron file, or you can run this to make {ipumsr} do it for you:
```
ipumsr::set_ipums_api_key("BLAH", save = TRUE)
```

Build a data extract with define_extract_micro(). This is the equivalent of adding stuff to your data cart on the IPUMS website. You’ll need to know two things:

The variable names, which you can find by searching the IPUMS website

The sample IDs, which you can find by running get_sample_info():

library(tidyverse)
library(ipumsr)

all_cps_samples <- get_sample_info(collection = "cps")

# Find the names for 2019 samples
all_cps_samples |> 
  filter(str_detect(description, "ASEC 2019"))
#> # A tibble: 13 × 2
#>    name        description              
#>                               
#>  1 cps2019_01s IPUMS-CPS, January 2019  
#>  2 cps2019_02s IPUMS-CPS, February 2019 
#>  3 cps2019_03b IPUMS-CPS, March 2019    
#>  4 cps2019_04b IPUMS-CPS, April 2019    
#>  5 cps2019_05s IPUMS-CPS, May 2019      
#>  6 cps2019_06s IPUMS-CPS, June 2019     
#>  7 cps2019_03s IPUMS-CPS, ASEC 2019  
#>  ...

The code for creating the extract will look like this:

cps_extract_definition <- define_extract_micro(
  collection = "cps",
  description = "API extract for blog post",
  samples = c(
    "cps2019_03s",  # ASEC, March 2019
    "cps2019_09s",  # CPS, September 2019
    "cps2018_11s"   # CPS, November 2018
  ),
  variables = c(
    "AGE", "SEX", "MARST", "EDUC",
    "VLDONATE", "VLSTATUS", "VLSUPPWT",
    "VOTED", "VOSUPPWT"
  )
)

This is identical to what we had in the cart on the the website: 9 variables and 3 samples:

cps_extract_definition
#> Unsubmitted IPUMS CPS extract 
#> Description: API extract for blog post
#> 
#> Samples: (3 total) cps2019_03s, cps2019_09s, cps2018_11s
#> Variables: (9 total) AGE, SEX, MARST, EDUC, VLDONATE, VLSTATUS, VLSU...

Submit the request to the server to generate the extract. This is equivalent to checking out your data cart on the website.
```
cps_extract <- submit_extract(cps_extract_definition)
#> Successfully submitted IPUMS CPS extract number ZZZZ
```

Download the extract. The extract won’t be downloadable immediately—you need to wait for e-mail confirmation. Once it’s ready, you can download it with download_extract(), which will download both the .dat.gz data and the .xml codebook to your computer:

cps_downloaded <- download_extract(cps_extract, download_dir = "raw_data")
#>  |==================================================| 100%
#>  |==================================================| 100%
#> DDI codebook file saved to ~/blah/raw_data/cps_ZZZZ.xml
#> Data file saved to ~/blah/raw_data/cps_ZZZZ.dat.gz

Once it’s on your computer, you can load it with the standard {ipumsr} process shown in the next section.
```
cps_data <- read_ipums_micro(cps_downloaded)
```

The IPUMS API and literate programming

If you’re using a literate programming document Quarto or R Markdown, don’t include this API extraction process in your document. It will rerun every time you render your document and create a new IPUMS extract each time, which is excessive. It’s best to run this process in a separate R script or function (perhaps orchestrated with something like {targets}), and then load the DDI .xml and .dat data in the document.

Loading CPS data

Getting this data into R is easy thanks to the {ipumsr} package. We feed the XML DDI codebook into read_ipums_ddi() and then feed that into read_ipums_micro()

library(ipumsr)

ddi <- read_ipums_ddi("raw_data/cps_00001.xml")
cps_data <- read_ipums_micro(ddi, data_file = "raw_data/cps_00001.dat.gz", verbose = FALSE)

glimpse(cps_data)
## Rows: 421,402
## Columns: 21
## $ YEAR      2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ SERIAL    1, 1, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12, 13, 13, 14, 14, 14, 15, 15, 16, 17, 19, 20, 20, 21, 21, 22, 23, 23, 23, 23, 24, 25, 26, 26, 26, 26, 26, 28, 28, 28, 28, 29, 30, 30, 30, 32, 36, 36, 37, 37, 37, 37, 39, 39, 39, 39, 39, 40, 40, 41, 41, 41, 41, 42,…
## $ MONTH     11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
## $ HWTFINL   1704, 1704, 1957, 1688, 1688, 1688, 1688, 2090, 2090, 1832, 1832, 1779, 1779, 1779, 1853, 1853, 2077, 2077, 1427, 1427, 1611, 2044, 1738, 1738, 1690, 1690, 1690, 3135, 3135, 2679, 2253, 1639, 1615, 1615, 1515, 1515, 2254, 1459, 1459, 1459, 1459, 1960, 1942, 1701, 1701, 1701, 1701,…
## $ CPSID     2.017e+13, 2.017e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e…
## $ ASECFLAG  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ASECWTH   NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ PERNUM    1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 1, 2, 3, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 1, 1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 1, 2, 3, 1, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 1, 1, 1, 2, 1, 1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4,…
## $ WTFINL    1704, 1845, 1957, 1688, 2780, 2780, 2679, 2090, 2090, 1832, 2679, 1754, 1779, 2452, 1853, 1870, 2151, 2077, 2003, 1427, 1611, 2044, 1738, 1738, 1690, 2102, 2537, 3135, 4172, 2679, 2253, 1639, 1615, 2609, 1900, 1515, 2254, 1459, 1546, 1912, 1752, 1960, 1942, 1527, 1701, 2231, 2104,…
## $ CPSIDP    2.017e+13, 2.017e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.018e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e+13, 2.017e…
## $ CPSIDV    2.017e+14, 2.017e+14, 2.018e+14, 2.017e+14, 2.017e+14, 2.017e+14, 2.017e+14, 2.018e+14, 2.018e+14, 2.017e+14, 2.017e+14, 2.018e+14, 2.018e+14, 2.018e+14, 2.017e+14, 2.017e+14, 2.018e+14, 2.018e+14, 2.018e+14, 2.018e+14, 2.017e+14, 2.017e+14, 2.017e+14, 2.017e+14, 2.017e+14, 2.017e…
## $ ASECWT    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ AGE       26, 26, 48, 53, 16, 16, 20, 22, 23, 57, 23, 61, 62, 39, 74, 49, 54, 52, 69, 76, 41, 56, 64, 62, 53, 13, 21, 28, 28, 40, 51, 78, 64, 40, 59, 36, 27, 35, 36, 5, 8, 76, 80, 36, 33, 3, 6, 11, 62, 80, 61, 61, 57, 24, 22, 26, 61, 24, 0, 37, 3, 5, 32, 33, 8, 9, 10, 15, 55, 61, 29, 29…
## $ SEX       2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2…
## $ MARST     6, 6, 4, 4, 6, 6, 6, 6, 6, 4, 6, 1, 1, 6, 5, 6, 1, 1, 1, 1, 6, 4, 6, 6, 5, 9, 6, 1, 1, 6, 3, 5, 6, 6, 1, 1, 6, 1, 1, 9, 9, 5, 4, 1, 1, 9, 9, 9, 6, 5, 4, 3, 4, 6, 6, 6, 3, 6, 9, 3, 9, 9, 6, 6, 9, 9, 9, 6, 1, 1, 1, 1, 9, 9, 6, 4, 1, 1, 4, 1, 1, 9, 1, 1, 1, 1, 9, 9, 1, 1, 4, 1, 5…
## $ EDUC      111, 123, 73, 81, 50, 50, 81, 81, 81, 111, 81, 81, 81, 92, 81, 81, 123, 111, 81, 60, 111, 73, 73, 73, 81, 1, 81, 81, 81, 91, 111, 60, 73, 73, 125, 81, 92, 124, 111, 1, 1, 60, 73, 123, 125, 1, 1, 1, 73, 30, 73, 73, 124, 81, 92, 73, 73, 73, 1, 20, 1, 1, 20, 73, 1, 1, 1, 30, 111,…
## $ VLSTATUS  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ VLDONATE  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ VLSUPPWT  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ VOTED     98, 98, 2, 2, 99, 99, 2, 99, 99, 98, 98, 2, 2, 2, 1, 97, 2, 2, 1, 1, 2, 98, 2, 1, 2, 99, 1, 1, 2, 98, 98, 2, 98, 98, 98, 98, 2, 2, 2, 99, 99, 1, 2, 1, 1, 99, 99, 99, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 99, 98, 99, 99, 98, 2, 99, 99, 99, 99, 1, 1, 2, 2, 99, 99, 2, 98, 2, 2, 96, 2, 2,…
## $ VOSUPPWT  1704, 1845, 1957, 1688, 2780, 2780, 2679, 2090, 2090, 1832, 2679, 1754, 1779, 2452, 1853, 1870, 2151, 2077, 2003, 1427, 1611, 2044, 1738, 1738, 1690, 2102, 2537, 3135, 4172, 2679, 2253, 1639, 1615, 2609, 1900, 1515, 2254, 1459, 1546, 1912, 1752, 1960, 1942, 1527, 1701, 2231, 2104,…

holy moly we have nearly half a million rows. That’s because we have three samples (2019 ASEC, September 2019, and November 2018) and they’re all stacked on top of each other in this data. We need to filter this huge data to extract the three samples. We’ll also remove rows with missing data.

cps_demographics <- cps_data |>
  # Only look at the 2019 ASEC data
  filter(YEAR == 2019, MONTH == 03, ASECFLAG == 1) |>
  # Remove rows that are missing or are "not in universe"
  mutate(
    SEX = ifelse(SEX == 9, NA, SEX),
    MARST = ifelse(MARST == 9, NA, MARST),
    EDUC = ifelse(EDUC < 1 | EDUC == 999, NA, EDUC)
  )

cps_volunteer <- cps_data |> 
  filter(YEAR == 2019, MONTH == 09) |> 
  # Remove rows that are missing or are "not in universe"
  mutate(across(c(VLSTATUS, VLDONATE), \(x) ifelse(x == 99, NA, x)))

cps_voting <- cps_data |> 
  filter(YEAR == 2018, MONTH == 11) |>
  # Remove rows that are missing or are "not in universe"
  mutate(VOTED = ifelse(VOTED == 99, NA, VOTED))

These counts are more reasonable (but still huge!)

nrow(cps_demographics)
## [1] 180101
nrow(cps_volunteer)
## [1] 118557
nrow(cps_voting)
## [1] 122744

Summarizing CPS data

Ultimately, our goal is to find the population-level average of a bunch of characteristics and see if our sample plausibly matches population averages.

Things get a little tricky and loosey-goosey here. The different levels measured by the CPS don’t always match what’s in the survey. For example, the CPS measures sex and provides only 2 levels (1 = male; 2 = female); the experiment called this construct gender and included male, female,¹ transgender, prefer not to say, and other.

¹ We should have called these “man” and “woman” since gender ≠ sex.

To make the survey question reasonably match what the CPS is capturing, I find that it’s easiest to collapse both the survey data and the CPS data to simpler constructs. Before we collapse things, though, we need to look at one statistical issue: weighting.

Weighting

In an effort to make the CPS nationally representative, every row is weighted—each individual does not represent the same number of persons in the population. The Census oversamples some subpopulations and shifts weights up and down to give individuals more or less statistical influence in the sample so that the survey results better approximate the characteristics of the general population. Any analysis we do with CPS data needs to take those weights into account.

The weights for ASEC variables are included in the ASECWT column; the weights for volunteering and voting variables are in the VLSUPPWT and VOSUPPWT columns

If we’re calculating basic averages, we can use weighted.mean() instead of mean(). Note the difference in average when we don’t weight!

cps_demographics |> 
  summarize(
    avg_age_weighted = weighted.mean(AGE, w = ASECWT),  # BAD
    avg_age_unweighted = mean(AGE)  # GOOD
  )
## # A tibble: 1 × 2
##   avg_age_weighted avg_age_unweighted
##                            
## 1             38.8               37.3

If we’re doing stuff with models, we can use the weights argument:

# BAD: Non-weighted intercept-only model
lm(AGE ~ 1, data = cps_demographics) |> 
  tidy(conf.int = TRUE)
## # A tibble: 1 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##                                   
## 1 (Intercept)     37.3    0.0540      690.       0     37.2      37.4

# GOOD: Weighted intercept-only model
lm(AGE ~ 1, data = cps_demographics, weights = ASECWT) |> 
  tidy(conf.int = TRUE)
## # A tibble: 1 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##                                   
## 1 (Intercept)     38.8    0.0542      715.       0     38.7      38.9

Base R only really has weighted.mean(). If we want other things, like a weighted variance, or weighted rank, or weighted table/crosstabs, we can use a bunch of different functions in the {Hmisc} package:

# Some Hmisc::wtd.*() things:
cps_demographics |> 
  summarize(
    avg_age = Hmisc::wtd.mean(AGE, weights = ASECWT),
    var_age = Hmisc::wtd.var(AGE, weights = ASECWT),
    sd_age = sqrt(var_age)
  )
## # A tibble: 1 × 3
##   avg_age var_age sd_age
##          
## 1    38.8    529.   23.0

Calculating population-level proportions

We’ll collapse these population-level CPS values into binary versions of each question so that we can look at things like the proportion of women, the proportion of people who volunteer, and so on. We’ll also collapse age into a binary above/below the median age—this isn’t necessary, and we could totally work with numeric age instead of proportions, but in our anonymized survey data, our age column is an indicator representing being above/below 36 (the median age at the time of the survey).

We’ll do some basic summarizing with weighted.mean() and calculate all these national proportions, along with the weighted standard deviations (which will be important for the Bayesian analysis later in this post).

national_demographics <- cps_demographics |> 
  summarize(
    # AGE is already numeric
    age = weighted.mean(AGE >= 36, ASECWT), 
    age_sd = sqrt(Hmisc::wtd.var(AGE >= 36, weights = ASECWT)),

    # 1 = Female
    female = weighted.mean(SEX == 2, ASECWT),
    female_sd = sqrt(Hmisc::wtd.var(SEX == 2, weights = ASECWT)),

    # 1 = Married, spouse present
    # 2 = Married, spouse absent
    married = weighted.mean(MARST %in% 1:2, na.rm = TRUE),
    married_sd = sqrt(Hmisc::wtd.var(MARST %in% 1:2, weights = ASECWT)),

    # 111 = Bachelor's degree
    college = weighted.mean(EDUC >= 111, ASECWT, na.rm = TRUE),
    college_sd = sqrt(Hmisc::wtd.var(EDUC >= 111, weights = ASECWT))
)

national_volunteer <- cps_volunteer |> 
  summarize(
    # 1 = Volunteer
    volunteering = weighted.mean(VLSTATUS == 1, VLSUPPWT, na.rm = TRUE),
    volunteering_sd = sqrt(Hmisc::wtd.var(VLSTATUS == 1, weights = VLSUPPWT)),

    # 2 = Yes, made a donation to charity in the past 12 months
    donating = weighted.mean(VLDONATE == 2, VLSUPPWT, na.rm = TRUE),
    donating_sd = sqrt(Hmisc::wtd.var(VLDONATE == 2, weights = VLSUPPWT))
  )

national_voting <- cps_voting |>
  summarize(
    # 2 = Voted in the most recent November election
    voting = weighted.mean(VOTED == 2, VOSUPPWT, na.rm = TRUE),
    voting_sd = sqrt(Hmisc::wtd.var(VOTED == 2, weights = VOSUPPWT))
  )

I like to store these in a little one-row data frame so that it’s easy to access invidiual values:

national_values <- bind_cols(
  national_demographics, national_volunteer, national_voting
)
national_values
## # A tibble: 1 × 14
##     age age_sd female female_sd married married_sd college college_sd volunteering volunteering_sd donating donating_sd voting voting_sd
##                                                                   
## 1 0.530  0.499  0.510     0.500   0.410      0.492   0.258      0.437        0.300           0.458    0.474       0.499  0.534     0.499

# Proportion of women
national_values$female
## [1] 0.5097

# Proportion that voted
national_values$voting
## [1] 0.5344

Summarizing sample proportions

We’re almost done! All that’s left is testing whether the demographic characteristics of the survey experiment respondents reasonably match their corresponding population proportions.

First, though, we need to make binary versions of the survey responses. To make life easier, we’ll use the same names as the CPS data:

results_to_test <- results |> 
  mutate(
    age = age == "More than median",
    female = gender == "Female",
    married = marital_status == "Married",
    college = education %in% c(
      "4 year degree", 
      "Graduate or professional degree", 
      "Doctorate"
    ),
    volunteering = volunteer_frequency != "Haven't volunteered in past 12 months",
    donating = donate_frequency == "More than once a month, less than once a year",
    voting = voted == "Yes"
  ) |> 
  select(female, age, married, college, volunteering, donating, voting)

glimpse(results_to_test)
## Rows: 1,300
## Columns: 7
## $ female        FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, …
## $ age           TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, F…
## $ married       TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRU…
## $ college       TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, …
## $ volunteering  TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TR…
## $ donating      FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ voting        TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE…

Testing sample vs. population proportions frequentist-ly

One-sample proportion test for age

As a quick and easy check, we can run a one-sample proportion test to see if the proportion of a variable is significantly different from a null value. We can do this with prop.test(), which works a bunch of different ways—with matrices, with vectors, and with single values (see this blog post for some other examples of prop.test()).

Let’s look at age first. 50.77% of people in the sample are older than 36; 53% of people in the population are older than 36:

# Proportion of sample older than 36
mean(results_to_test$age)
## [1] 0.5077

# CPS proportion older than 36
national_values$age
## [1] 0.5303

Is that an issue? Is the sample significantly younger than the rest of the country?

For this one-sample test, we need to feed prop.test() three things: (1) the number of “successes”, or the count of rows where the respondent is older than 36 (or where age is TRUE), (2) the number of rows in the sample, and (3) the null value, or the population-level CPS proportion:

prop_test_freq_age <- prop.test(
  x = sum(results_to_test$age),  # Number of "successes" (rows where age == TRUE)
  n = nrow(results_to_test),     # Sample size
  p = national_values$age        # Population-level proportion from the CPS
)

tidy(prop_test_freq_age)
## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method                                               alternative
##                                                                                   
## 1    0.508      2.59   0.108         1    0.480     0.535 1-sample proportions test with continuity correction two.sided

For fun, we can plot this too:

Code for making this plot ↓

Apple Music Wrapped with R

Andrew Heiss — Wed, 04 Dec 2024 05:00:00 GMT

’Tis the season for Spotify Wrapped stats and I love it, both for seeing what everyone listens to and because it’s such a cool way of presenting data. A few years ago on Twitter, Caitlin Hudon noted that

Spotify Wrapped is a great example of how you can build a fantastic data product without maching learning or AI. (@beeonaposy)

At its core, Spotify Wrapped is really just some grouped and summarized data—a PivotTable with some album cover art slapped on. And it’s fun and neat and everyone loves it!

I’ve always been jealous of everyone’s annual Spotify Wrapped reports, but since I don’t use Spotify, I’ve never gotten to see my own details.

Because I’m an Elder Millennial and started listening to music in the days of Napster, I prefer to control my music files rather than stream it Spotify-style, so I get all my stuff from either the Amazon Music store or Bandcamp since they both provide DRM-free MP3s. I listen to everything in the used-to-be-iTunes Music app (not to be confused with Apple’s music streaming service, Apple Music), and I use iTunes Match to access my library across all my devices.¹

¹ I also have it all backed up to a Plex server on a Synology NAS in my house, and my kids listen to music on it through the Plexamp app, but I don’t because I still prefer using the iTunes/Apple Music desktop app 🤷‍♂️.

iTunes/Music keeps track of some song metadata, like a count of the number of times a song has been played:

All that metadata is stored in a big ol’ gross XML file. In days of iTunes, you could find it at ~/Music/iTunes/iTunes Library.xml; with Apple Music, it’s hidden in ~/Music/Music/Music Library/Library.musicdb. The easiest way to access it is to export a copy of it from Music with File > Library > Export Library…. It has a bunch of neat details about each file in your library:

<key>34813key>
<dict>
    <key>Track IDkey><integer>34813integer>
    <key>Namekey><string>In Another Lifestring>
    <key>Artistkey><string>The Killersstring>
    <key>Album Artistkey><string>The Killersstring>
    <key>Albumkey><string>Pressure Machinestring>
    <key>Genrekey><string>Alternative Rockstring>
    <key>Kindkey><string>MPEG audio filestring>
    <key>Sizekey><integer>7632215integer>
    <key>Total Timekey><integer>225724integer>
    <key>Disc Numberkey><integer>1integer>
    <key>Disc Countkey><integer>1integer>
    <key>Track Numberkey><integer>8integer>
    <key>Track Countkey><integer>11integer>
    <key>Yearkey><integer>2021integer>
    <key>Date Modifiedkey><date>2021-08-13T17:38:22Zdate>
    <key>Date Addedkey><date>2021-08-13T13:38:36Zdate>
    <key>Bit Ratekey><integer>268integer>
    <key>Sample Ratekey><integer>44100integer>
    <key>Commentskey><string>Amazon.com Song ID: REDACTEDstring>
    <key>Play Countkey><integer>97integer>
    <key>Play Datekey><integer>3815892895integer>
    <key>Play Date UTCkey><date>2024-12-01T15:14:55Zdate>
    <key>Ratingkey><integer>100integer>
    <key>Album Ratingkey><integer>100integer>
    <key>Album Rating Computedkey><true/>
    <key>Normalizationkey><integer>6230integer>
    <key>Artwork Countkey><integer>1integer>
    <key>Sort Album Artistkey><string>Killersstring>
    <key>Sort Artistkey><string>Killersstring>
    <key>Persistent IDkey><string>211319FB11435185string>
    <key>Track Typekey><string>Filestring>
    <key>Locationkey><string>file:///Users/andrew/Music/iTunes/iTunes%20Music/Music/The%20Killers/Pressure%20Machine/08%20In%20Another%20Life.mp3string>
    <key>File Folder Countkey><integer>5integer>
    <key>Library Folder Countkey><integer>1integer>
dict>

It keeps track of play count…

    <key>Play Countkey><integer>97integer>
    <key>Play Datekey><integer>3815892895integer>
    <key>Play Date UTCkey><date>2024-12-01T15:14:55Zdate>

…but unfortunately for Spotify Wrapped purposes, it overwrites the count and date information when you listen to a track—it doesn’t keep track of individual play counts. Here’s what the XML for “In Another Life” looked like before I listened to the track while writing this post:

<key>Play Countkey><integer>96integer>
<key>Play Datekey><integer>3808562446integer>
<key>Play Date UTCkey><date>2024-09-07T18:00:46Zdate>

That September 7th listen was erased from history once I hit play in December :(

That means it’s impossible to figure out how many times you listen to a track during a given time period—the play count only shows the most recent listen. With one XML export, you can’t find Spotify Wrapped-like details about listening habits in a single year.

However, if you have two XML exports, you can!

Calculating 2024 play counts with R

I played the long game this year and exported a copy of my iTunes/Music library on the morning of January 1 and stored the XML file in a folder on my computer. I then exported a copy of the library as it stands today. With these two library files, I can subtract the play count from January 1 from the play count today and find how many times I listened to each track. It still doesn’t give me date information—there’s no way to see time trends like what I was listening to in March or whatever²—but it gives me good data to work with.

² If I were super on top of things and cared that much, I could set up a script to automatically export a copy of the library every day and then reverse engineer daily listening data, but that seems like an excessive amount of work.

In the spirit of Caitlin’s tweet, I’m going to keep the analysis of this data as simple and straightforward as possible—just filtering, grouping, and summarizing.

The only bit of fancy R work comes at the beginning with parsing and cleaning the Apple Music XML files. The track information is deeply nested inside a bunch of XML layers and untangling all that requires some data wrangling. Fortunatley Simon Couch already did it in his 2022 analysis of his music, and he even made an accompanying package {wrapped} for doing it yourself. His package is designed to extract the play counts of all the music added in a given year, while I want the counts for all years, so I modified his wrap_library() function slightly to ignore the year argument and just parse everything. The modified function, now read_itunes_library() is below, for the morbidly curious:

R code for read_itunes_library()

Guide to generating and rendering computational markdown content programmatically with Quarto

Andrew Heiss — Mon, 04 Nov 2024 05:00:00 GMT

This year, I’ve helped build the Idaho Secretary of State’s office’s election results website for both the primary and general elections. Working with election data is a complex process, with each precinct reporting results to their parent counties, which all use different systems and software and candidate identifiers. Those county results then go into a central state-level database that state officials have access to for analysis and reporting.

In 2024, Idaho used a Quarto website to present the results for each statewide, congressional, and legislative contest (the URL is results.voteidaho.gov, though the Quarto website part probably won’t live there forever):

I may write something up someday about the process of building the website, depending on NDAs and security and copyright arrangements and whatnot. There are some neat technical details involved in the whole process, like complex static {targets} branching, remote {targets} storage, and replicating the structure of the real, live results database with a local DuckDB database for testing things without connecting to the live database. The short, sanitized version is that it uses two {targets} pipelines:

An ETL pipeline connects to the central results database, retrieves the latest results for each contest, standardizes the idenfitiers and cleans up the results, generates summary tables and maps for each race, and pushes the cleaned results to a shared datastore.
A website pipeline grabs the latest cleaned results from the shared datastore and builds a Quarto website.

The two pipelines run independently of each other every few minutes, and thanks to the magic of {targets}, if there are no updates (i.e. if there’s a lull in the reporting on election night), nothing needs to rebuild. It’s really neat.

There’s one Quarto/R Markdown trick that I used extensively when building the site: it’s possible to use R to automatically generate Quarto markdown before the entire document runs, allowing you to create parameterized templates for repeated elements.

Each race uses a tabset to show panels for (1) a table and (2) an interactive map of the results, and the reporting status for all the counties involved in the race is included in a callout block.

Results table
Results map

The markdown for each of these race results sections looks something like this:

## Parent district

::: {.callout-note title="X/Y counties fully reported" collapse="true"}
**Complete**: list of counties

**In progress**: list of counties
:::

### Race name

::: {.panel-tabset}
#### Table

```{r}
# R code for creating the table
```

#### Map

```{r}
# R code for creating the map
```
:::

For pages where there’s only one race, like the presidential election and the state’s constitutional amendment election, it’s trivial enough to just copy/paste that general template and replace the corresponding R code. But for the state-level legislative page, there are dozens of races. Repeating and modifying all that markdown 100+ times would be miserable. So instead, we programmatically generate the markdown for each race before the site is rendered so that Quarto thinks it’s working with hand-typed markdown.

Generating big chunks of markdown like this is a really cool approach with all sorts of applications (generate sections of a website; generate panel tabsets; generate presentation slides; etc.), but it’s a little unwieldly at first. So in this post, I’ll (1) show why this is trickier than just using regular R chunks with results="asis", (2) present a detailed step-by-step explanation of how to pre-render generated computational chunks, (3) provide a shorter, simpler, less-annotated example, and (4) give a more complex, less-annotated example.

Why not just use `results="asis"`?

It’s easy to use R/Python chunks to generate HTML or LaTeX or Typst or markdown and have that output appear in the rendered document—this is essentially what table-making packages like {tinytable}, {gt}, and {kableExtra} all do. To illustrate this, let’s load {gapminder} data:

library(tidyverse)
library(glue)
library(gapminder)

gapminder_2007 <- gapminder |> filter(year == 2007)

We can create a markdown list of all the continents in the dataset. Here I do it with paste0(), but I could also use a nicer wrapper like {pander}, which includes all sorts of functions for generating markdown.

This correctly makes a list, but it doesn’t get rendered like a list—it’s displayed as code chunk output:

continents <- gapminder_2007 |> 
  distinct(continent) |> 
  pull(continent)

cat(paste0("- ", continents, collapse = "\n"))

- Asia
- Europe
- Africa
- Americas
- Oceania

We can tell Quarto to treat the output of that chunk as raw markdown instead by setting the results="asis" chunk option:

```{r}
#| results: asis
cat(paste0("- ", continents, collapse = "\n"))
```

Asia
Europe
Africa
Americas
Oceania

That’s great and normal and I use this approach all the time for generating non-computational markdown.

Where this doesn’t work is when you have R chunks that need to be computed. To show this, let’s make a list showing π rounded to different digits using inline code chunks. We can manually type it like this:

- `r round(pi, 1)`
- `r round(pi, 2)`
- `r round(pi, 3)`
- `r round(pi, 4)`
- `r round(pi, 5)`

…which renders like this:

3.1
3.14
3.142
3.1416
3.1416

But that’s a lot of typing. So let’s generate it automatically. I’ll do this in a data frame, just because I like working that way, but you could also use standalone vectors (or even—gasp—a loop!):

pi_stuff <- tibble(digits = 1:5) |> 
  mutate(list_element = paste0("- `r round(pi, ", digits, ")`"))

# Everything is in the list_element column:
pi_stuff
## # A tibble: 5 × 2
##   digits list_element      
##                  
## 1      1 - `r round(pi, 1)`
## 2      2 - `r round(pi, 2)`
## 3      3 - `r round(pi, 3)`
## 4      4 - `r round(pi, 4)`
## 5      5 - `r round(pi, 5)`

We can put that column in a results="asis" chunk…

```{r}
#| results: asis
cat(paste0(pi_stuff$list_element, collapse = "\n"))
```

r round(pi, 1)
r round(pi, 2)
r round(pi, 3)
r round(pi, 4)
r round(pi, 5)

…and it renders correctly as markdown, but it doesn’t run the inline chunks :(

This is because of an issue with ordering: Quarto renders the chunk with cat(paste0(...)) and then moves on to the next chunk in the document. It won’t render the R chunks that the pi_stuff$list_element object contains because they’re all nested inside the parent chunk, and Quarto’s rendering process has moved on by the time the newly generated R chunks appear.

The trick is to pre-render the chunks before they officially show up in the document.¹ We can feed the collapsed pi_stuff$list_element object to knitr::knit() in an inline chunk, which makes Quarto render all the R chunks inside the chunk first, then place the output in the document to be rendered in the correct order like normal chunks:

¹ I haven’t found this formally documented anywhere—I stumbled across this approach in this gist from 2015.

Here's some regular markdown text. Let's show a list of 
differently-rounded values of $\pi$ for fun:

`r knitr::knit(text = paste0(pi_stuff$list_element, collapse = "\n"))`

Isn't that neat?

That markdown will render to this:

Here’s some regular markdown text. Let’s show a list of differently-rounded values of for fun:

3.1
3.14
3.142
3.1416
3.1416

Isn’t that neat?

Building a panel tabset with an inline chunk

Technically there was no need to use knitr::knit() in an inline chunk for that previous example. It would be easier to generate the text output within the pi_stuff data frame instead of in a bunch of inline chunks, and then show the results like normal with results="asis"

```{r}
#| results: asis
pi_stuff_easier <- tibble(digits = 1:5) |> 
  mutate(list_element = paste0("- ", round(pi, digits)))

cat(paste0(pi_stuff_easier$list_element, collapse = "\n"))
```

3.1
3.14
3.142
3.1416
3.14159

However, using knitr::knit(text = BLAH) in an inline chunk like this is a powerful trick that lets you do all sorts of more complex document generation automation. Let’s make a more complicated example with real data instead of a bunch of π rounding.

For this example, let’s make a panel tabset with a plot for each continent in the gapminder_2007 dataset we made earlier.

First, we’ll make a list of plots. This can be done any number of ways—I like using group_by() |> nest() and {purrr} functions like map(), but any way will work as long as you have a list of ggplot objects in the end.

continents_plots <- gapminder_2007 |> 
  group_by(continent) |> 
  nest() |> 
  ungroup() |> 
  # We could use map2(), but I like using pmap() just in case I need to
  # expand it beyond 2 things
  mutate(plot = pmap(
    lst(data, continent), 
    \(data, continent) {
      plot_title <- paste0("Health and wealth in ", continent)
      ggplot(data, aes(x = gdpPercap, y = lifeExp)) +
        geom_point() +
        scale_x_log10(labels = scales::label_dollar(accuracy = 1)) +
        labs(title = plot_title)
    }))

continents_plots
## # A tibble: 5 × 3
##   continent data              plot  
##                    
## 1 Asia         
## 2 Europe       
## 3 Africa       
## 4 Americas     
## 5 Oceania

The plots are all in the plot column in continents_plots, which is a list of ggplot objects. Here’s one of them:

continents_plots$plot[[3]]

To make a tabset with a panel for each continent, we need to write markdown like this:

::: {.panel-tabset}
### Continent 1

```{r}
#| label: panel-continent-a
#| echo: false
continents_plots$plot[[1]]
```

### Continent 2

```{r}
#| label: panel-continent-b
#| echo: false
continents_plots$plot[[2]]
```

### (…and so on…)

:::

We could just copy/paste those continent sections over and over, but that’s tedious and not very dynamic. Instead, we can create a little markdown template for each panel and generate all these chunks. To do that, we’ll use {glue}, which is a lot nicer for building strings than using paste0(), since it uses Python-style string interpolation. glue::glue() replaces any text inside {}s with the corresponding variable value:

# Some values
animals <- "cats"
number <- 12
location <- "house"

# Ugly paste() way
paste0("There are ", number, " blue ", animals, " in the ", location)
## [1] "There are 12 blue cats in the house"

# Nice glue() way
glue("There are {number} blue {animals} in the {location}")
## There are 12 blue cats in the house

If you need to use literal curly braces in the text, you can either double them or change the delimiters:

pkg_name <- "ggplot2"

# The curly braces disappear
glue("The {pkg_name} package is delightful")
## The ggplot2 package is delightful

# Double them to keep them
glue("The {{{pkg_name}}} package is delightful")
## The {ggplot2} package is delightful

# Or change the delimiter
glue("The {<>} package is delightful", .open = "<<", .close = ">>")
## The {ggplot2} package is delightful

Being able to change the delimiter is useful since we’ll need to generate chunks that start with ```{r}.

We can use glue() in a function that takes a continent name and a row number and generates a markdown tabset panel:

build_panel <- function(panel_title, plot_index) {
  chunk_label <- glue("panel-continent-{title}", title = janitor::make_clean_names(panel_title))

  output <- glue("
  ### <>

  ```{r}
  #| label: <>
  #| echo: false
  continents_plots$plot[[<>]]
  ```", .open = "<<", .close = ">>")

  output
}

First let’s make sure it works by itself:

build_panel("Africa", 3)

### Africa

```{r}
#| label: panel-continent-africa
#| echo: false
continents_plots$plot[[3]]
```

Yep!

Now we can iterate through the data frame of all the continents (continents_plots) and make a column that contains the markdown panel text.

continents_plots_with_text <- continents_plots |> 
  mutate(row = row_number()) |> 
  mutate(markdown = pmap_chr(
    lst(continent, row), 
    \(continent, row) build_panel(panel_title = continent, plot_index = row)
  ))

continents_plots_with_text

# A tibble: 5 × 5
  continent data              plot     row markdown                                                                                                   
                                                                                                                           
1 Asia              1 "### Asia\n\n```{r}\n#| label: panel-continent-asia\n#| echo: false\ncontinents_plots$plot[[1]]\n```"      
2 Europe            2 "### Europe\n\n```{r}\n#| label: panel-continent-europe\n#| echo: false\ncontinents_plots$plot[[2]]\n```"  
3 Africa            3 "### Africa\n\n```{r}\n#| label: panel-continent-africa\n#| echo: false\ncontinents_plots$plot[[3]]\n```"  
4 Americas          4 "### Americas\n\n```{r}\n#| label: panel-continent-americas\n#| echo: false\ncontinents_plots$plot[[4]]\n`…
5 Oceania            5 "### Oceania\n\n```{r}\n#| label: panel-continent-oceania\n#| echo: false\ncontinents_plots$plot[[5]]\n```"

Check out that new markdown column—it has a third level heading with the continent name, followed by an R chunk that will display the corresponding plot object from continents_plots. Here’s what one panel looks like:

cat(continents_plots_with_text$markdown[[1]])

### Asia

```{r}
#| label: panel-continent-asia
#| echo: false
continents_plots$plot[[1]]
```

Finally, we need to concatenate that column into one big string and include it as an inline chunk inside Quarto’s syntax for tabsets:

Health and wealth are related in each continent.

::: {.panel-tabset}

`r knitr::knit(text = paste0(continents_plots_with_text$markdown, collapse = "\n\n"))`

:::

Automatic tabset panels!

Here’s what it looks like when rendered:

Health and wealth are related in each continent.

Automatic tabset panels!

Perfect! We successfully generated a bunch of R chunks, pre-rendered them with Quarto, and then rendered the rest of the document.

Condensed example showing the evolution of a ggplot plot

Now that we’ve walked through the general process in detail, we’ll look at a less didactic example. Suppose you want to show the step-by-step process of creating a ggplot plot in a tabset panel or in a Revealjs Quarto slideshow. You could manually copy/paste a bunch of markdown over and over (ew), or you could generate chunks and make Quarto make the panels or slides for you (yay).

First we’ll make a list of plots. There’s actually a neat new package—{ggreveal}—that can create a list of intermediate plots automatically, but I’ll just do it manually here (though this process will work the same with {ggreveal}).

p1 <- ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point()
p1_text <- glue("
  ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point()")

p2 <- p1 + 
  scale_x_log10(labels = scales::dollar_format(accuracy = 1))
p2_text <- glue("
  {p1_text} +
    scale_x_log10(labels = scales::dollar_format(accuracy = 1))
")

p3 <- p2 + 
  scale_color_viridis_d(option = "plasma", end = 0.9)
p3_text <- glue('
  {p2_text} +
    scale_color_viridis_d(option = "plasma", end = 0.9)
')

p4 <- p3 + 
  theme_minimal()
p4_text <- glue('
  {p3_text} +
    theme_minimal()
')

p5 <- p4 + 
  labs(x = "GDP per capita", y = "Life expectancy", color = "Continent")
p5_text <- glue('
  {p4_text} +
    labs(x = "GDP per capita", y = "Life expectancy", color = "Continent")
')

plot_list <- list(p1, p2, p3, p4, p5)
plot_text <- list(p1_text, p2_text, p3_text, p4_text, p5_text)

plot_list <- tribble(
  ~plot, ~code_text, ~description,
  p1, p1_text, "Start with the initial plot…",
  p2, p2_text, "…use a logarithmic x-axis…",
  p3, p3_text, "…change the color palette…",
  p4, p4_text, "…change the theme…",
  p5, p5_text, "…and change the default labels"
)

We want the overall tabset to look something like this:

::: {.panel-tabset}

### Step 1

Short description

```{r}
plot_list[[1]]
```

```r
# Code here
```

### Step 2

Short description

```{r}
plot_list[[2]]
```

```r
# Code here
```

### …and so on

:::

…so next we’ll generate markdown for each panel:

panels <- map_chr(
  seq_len(nrow(plot_list)), 
  \(i) {
    glue("
    ### Step <>

    `r plot_list$description[[<>]]`

    ```{r}
    #| label: plot-panel-<>
    #| echo: false
    plot_list$plot[[<>]]
    ```

    ```r
    `r plot_list$code_text[[<>]]`
    ```", .open = "<<", .close = ">>")
  }
)

Finally we’ll wrap Quarto’s special markdown syntax for tabsets around these panels and then include the combined text as an inline chunk in the document:

Let's slowly build up the plot: ::: {.panel-tabset} `r knitr::knit(text = paste0(panels, collapse = "\n\n"))` :::

Let’s slowly build up the plot:

Step 1

Step 2

Step 3

Step 4

Step 5

Start with the initial plot…

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point()

…use a logarithmic x-axis…

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10(labels = scales::dollar_format(accuracy = 1))

…change the color palette…

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10(labels = scales::dollar_format(accuracy = 1)) + scale_color_viridis_d(option = "plasma", end = 0.9)

…change the theme…

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10(labels = scales::dollar_format(accuracy = 1)) + scale_color_viridis_d(option = "plasma", end = 0.9) + theme_minimal()

…and change the default labels

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10(labels = scales::dollar_format(accuracy = 1)) + scale_color_viridis_d(option = "plasma", end = 0.9) + theme_minimal() + labs(x = "GDP per capita", y = "Life expectancy", color = "Continent")

Condensed example of continent-level mini reports

Finally, let’s look at one more example that’s similar to what I used for the Idaho election results website, making a sort of miniature report for each continent. This time, instead of creating a tabset panel for each continent, we’ll make a whole markdown section for each continent, with a tabset panel included in each.

Each continent section will look something like this:

### Continent name ::: {.callout-note icon="false" title="X countries" collapse="true"} Comma-separated list of countries ::: ::: {.panel-tabset} #### Details ```{r} #| label: table-summary-continent #| echo: false # A table showing average GDP per capita and average life expectancy ``` #### Plot ```{r} #| label: plot-summary-continent #| echo: false # A plot showing the relationship between GDP per capita and life expectancy ``` :::

As before, we’ll translate this template into a glue() string, feed some data into it, and generate a bunch of R chunks.

Child documents

This template is getting gnarly with so many moving parts and so many common {glue} delimiters like {}s and []s and <>s. An alternative approach is to put the template in a separate child Quarto document and then pre-render it with knitr::knit_child(), which behaves just like the inline knitr::knit(file = BLAH) approach we’ve been using. Quarto has an official example of how to do it here: example and code.

First, we’ll make a data frame with all the different pieces we want to include, using the same group_by(continent) |> nest() approach as before:

continent_report_items <- gapminder_2007 |> group_by(continent) |> nest() |> ungroup() |> mutate(country_list = map_chr(data, \(x) knitr::combine_words(x$country))) |> mutate(n_countries = map_int(data, \(x) nrow(x))) |> mutate(summary_details = map(data, \(x) { x |> summarize( `Average GDP per capita` = mean(gdpPercap), `Average life expectancy` = mean(lifeExp), `Average population` = mean(pop) ) |> pivot_longer(everything(), names_to = "Statistic", values_to = "Value") |> mutate(Value = scales::comma_format()(Value)) })) |> mutate(plot = map(data, \(x) { x |> ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10(labels = scales::label_dollar(accuracy = 1)) + labs(title = glue("Health and wealth in {continent}")) }))

Next we’ll make a function that generates the markdown output:

build_continent_report <- function(i) { name_for_labels <- janitor::make_clean_names(continent_report_items$continent[[i]]) # Quarto and RStudio and Positron all really struggle with syntax highlighting # and parsing when there are multiple ```s inside a string, so we can make # life easier by splitting the output into a few parts here, ensuring that # there's a maximum of one set of triple backticks output_first_part <- glue(' ### `r continent_report_items$continent[[<>]]` ::: {.callout-note icon="false" title="`r continent_report_items$n_countries[[<>]]` countries" collapse="true"} `r continent_report_items$country_list[[<>]]` :::', .open = "<<", .close = ">>") output_panel_details <- glue(' #### Details ```{r} #| label: table-summary-<> #| echo: false continent_report_items$summary_details[[<>]] |> knitr::kable() ```', .open = "<<", .close = ">>") output_panel_plot <- glue(' #### Plot ```{r} #| label: plot-summary-<> #| echo: false continent_report_items$plot[[<>]] ```', .open = "<<", .close = ">>") # Combine all the pieces output <- glue(' {output_first_part} ::: {{.panel-tabset}} {output_panel_details} {output_panel_plot} ::: ') output }

Finally, we’ll loop through each row in continent_report_items and generate the markdown report using the template…

continent_reports <- map_chr( seq_len(nrow(continent_report_items)), \(i) build_continent_report(i) )

…and include all the generated markdown in an inline chunk:

## Continent reports Check out all these automatically generated continent reports! `r knitr::knit(text = paste0(continent_reports, collapse = "\n\n"))`

That single inline chunk automatically generates dozens of inline and block chunks of R code before the full document goes through Quarto, which means all this output gets included in the final rendered version:

Generated output

Continent reports

Check out all these automatically generated continent reports!

Asia

33 countries

Afghanistan, Bahrain, Bangladesh, Cambodia, China, Hong Kong, China, India, Indonesia, Iran, Iraq, Israel, Japan, Jordan, Korea, Dem. Rep., Korea, Rep., Kuwait, Lebanon, Malaysia, Mongolia, Myanmar, Nepal, Oman, Pakistan, Philippines, Saudi Arabia, Singapore, Sri Lanka, Syria, Taiwan, Thailand, Vietnam, West Bank and Gaza, and Yemen, Rep.

Details

Plot

Statistic Value

Average GDP per capita 12,473

Average life expectancy 71

Average population 115,513,752

Europe

30 countries

Albania, Austria, Belgium, Bosnia and Herzegovina, Bulgaria, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Montenegro, Netherlands, Norway, Poland, Portugal, Romania, Serbia, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Turkey, and United Kingdom

Details

Plot

Statistic Value

Average GDP per capita 25,054

Average life expectancy 78

Average population 19,536,618

Africa

52 countries

Algeria, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d’Ivoire, Djibouti, Egypt, Equatorial Guinea, Eritrea, Ethiopia, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Kenya, Lesotho, Liberia, Libya, Madagascar, Malawi, Mali, Mauritania, Mauritius, Morocco, Mozambique, Namibia, Niger, Nigeria, Reunion, Rwanda, Sao Tome and Principe, Senegal, Sierra Leone, Somalia, South Africa, Sudan, Swaziland, Tanzania, Togo, Tunisia, Uganda, Zambia, and Zimbabwe

Details

Plot

Statistic Value

Average GDP per capita 3,089

Average life expectancy 55

Average population 17,875,763

Americas

25 countries

Argentina, Bolivia, Brazil, Canada, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, El Salvador, Guatemala, Haiti, Honduras, Jamaica, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, Trinidad and Tobago, United States, Uruguay, and Venezuela

Details

Plot

Statistic Value

Average GDP per capita 11,003

Average life expectancy 74

Average population 35,954,847

Oceania

2 countries

Australia and New Zealand

Details

Plot

Statistic Value

Average GDP per capita 29,810

Average life expectancy 81

Average population 12,274,974

↑ that’s how we were able to generate race-specific output for 100+ individual contests for the 2024 Idaho elections. It looks messy at first, but it’s a billion times easier to work with than copying/pasting markdown text and manually modifying the code in the chunks.

Statistic	Value
Average GDP per capita	12,473
Average life expectancy	71
Average population	115,513,752

Statistic	Value
Average GDP per capita	25,054
Average life expectancy	78
Average population	19,536,618

Statistic	Value
Average GDP per capita	3,089
Average life expectancy	55
Average population	17,875,763

Statistic	Value
Average GDP per capita	11,003
Average life expectancy	74
Average population	35,954,847

Statistic	Value
Average GDP per capita	29,810
Average life expectancy	81
Average population	12,274,974

Citation

BibTeX citation:

@online{heiss2024,
  author = {Heiss, Andrew},
  title = {Guide to Generating and Rendering Computational Markdown
    Content Programmatically with {Quarto}},
  date = {2024-11-04},
  url = {https://www.andrewheiss.com/blog/2024/11/04/render-generated-r-chunks-quarto/},
  doi = {10.59350/pa44j-cc302},
  langid = {en}
}

For attribution, please cite this work as:

Heiss, Andrew. 2024. “Guide to Generating and Rendering Computational Markdown Content Programmatically with Quarto.” November 4, 2024. https://doi.org/10.59350/pa44j-cc302.

Fun with Positron

Andrew Heiss — Mon, 08 Jul 2024 04:00:00 GMT

At the end of June 2024, Posit released a beta version of its next-generation IDE for data science: Positron. This follows Posit’s general vision for language-agnostic data analysis software: RStudio PBC renamed itself to Posit PBC in 2022 to help move away from a pure R focus, and Quarto is pan-lingual successor to R Markdown. Having the name of the main programming language in the title of things is out—providing more general tools is in.

Positron is essentially a specialized version of Microsoft’s Visual Studio Code, and is a fork of the underlying Code - OSS that powers VS Code. I’m super excited about this—in my own work, I use RStudio for most things R-related and VS Code for everything else (Stan, Python, HTML, CSS, Lua, LaTeX, Typst, etc.). VS Code is phenomenal and I love using it. It’s the best way to edit files on a remote server. It’s the best way to interact with Docker containers and Docker Compose. GitHub Copilot Chat is fantastic.

But for me, it’s never quite been a replacement for RStudio. Every couple months, I play around with trying to use VS Code for R work full time, but the constellation of VS Code R extensions (like the R extension and Radian for the terminal) and general R support has never been what I want, and I always end up going back to RStudio. Which is fine! I adore RStudio too and have been using it since it first came out in beta in February 2011 (13 years!).

Positron brings pretty much all the little R-related things that I love from RStudio and have missed in Visual Studio Code. The regular collection of VS Code’s R extensions and add-ons is no longer necessary, since Posit has created a custom R kernel—Ark—for any text editor or IDE with Jupyter support. It’s still a beta product and a little rough around the edges, but I’ve found that it really is the perfect blend of the best parts of RStudio and VS Code.

Below, following the example of Marc Dotson and Christopher Kenny, I want to highlight some of the neat new things Positron can do and share some of the settings, extensions, and other customizations I’ve been using for the past couple weeks.

Some cool new things

Positron brings RStudio’s best features to other languages like Python, like line-by-line code execution:

Line-by-line execution in Python
Line-by-line execution in R

…and the variables panel (equivalent to the Environment panel in RStudio):

Variables panel in Python
Variables panel in R

Variables panel in Python

Variables panel in R

You can also switch between different R and Python installations and versions. If you have rig installed, you can switch between different R versions, and Positron scans your computer at startup to find all the different Python virtual environments you have. I think eventually that little menu might also have better support for {renv} too. Here I can switch between R 4.4.0 and 4.3.3, as well as a bunch of random Python virtual environments and installations (lol python installations are the worst):

Version switcher

Plus, since it’s just a fancy version of VS Code, Positron supports pretty much everything VS Code can do, including making complex layouts. Use the little menu in the top right corner to set up your workspace however you want:

Customize layout

For instance, here’s an example of a fully armed and operational layout I’ve been using for one project:

Full ultrawide workspace

My settings

How to change settings in Positron / VS Code

Positron (and VS Code in general) stores all its settings in a JSON file named settings.json that’s stored somewhere on your computer. On macOS it’s in ~/Library/Application Support/Positron/User/settings.json (see here for other operating systems). But you don’t need to ever remember that!

When you open Positron’s settings (with ⌘, on a Mac; using… something… on Windows), Positron provides a nice frontend for searching, managing, and changing settings so you don’t need to edit raw JSON if you don’t want to.

If you click on the document button in the top right corner, you can open the actual settings.json file in the editor and make changes there. This is the easiest way to share settings with other people (like in this blog post) or with yourself (you can commit settings.json to a git repository, for instance).

Positron settings page
Positron
Configuring and customizing Positron involved basically copying most of my settings from VS Code’s settings.json into Positron’s settings.json. Here’s everything I have set up, with comments explaining stuff. A few things to note in particular:

I set rstudio.keymap.enable to true to enable most of RStudio’s R-related keyboard shortcuts (like ⌘⌥I for a new chunk, ⌥- to insert <-, etc.).

I’m using GitHub’s Monaspace font because it looks neat and it has excellent font ligatures. I’ve enabled a bunch of different stylistic sets for the ligatures.

I’m a big fan of the Monokai color theme and use it in RStudio and VS Code. It’s easy enough to set in Positron too, but for mysterious unknown reasons, it uses colors differently and is overly aggressive in what gets colorized. Compare this ggplot code across three different Monokais (VS Code, Positron, and RStudio). The Positron version is incredibly green and pink, while VS Code and RStudio use color more sparingly.

Monokai highlighting in VS Code, Positron, and RStudio + Positron Dark

So for now, I’m using the Positron Dark theme instead, which does the best job of highlighting the things that RStudio did. It’s nice enough.

Here’s my settings.json file. Adapt from it however you want. All these settings are also accessible in the GUI too.

GUI for enabling RStudio keymapping

There are some extension-specific options at the bottom that I’ll explain below too.

settings.json

{ // Positron-specific settings // ------------------------------------------------------------------------- "rstudio.keymap.enable": true, "python.defaultInterpreterPath": "/opt/homebrew/bin/python", // Editor settings // ------------------------------------------------------------------------- // Fonts // Use GitHub's Monaspace (https://github.com/githubnext/monaspace) and enable ligatures "editor.fontFamily": "'Monaspace Argon Var'", "editor.fontSize": 12.5, "editor.fontLigatures": "'ss01', 'ss02', 'ss03', 'ss04', 'ss05', 'ss06', 'ss07', 'ss08', 'calt', 'dlig', 'liga'", // Theme // Monakai would be nice, but it has issues in Positron // "workbench.colorTheme": "Monokai", "workbench.colorTheme": "Default Dark Modern", // Use nicer icons "workbench.productIconTheme": "fluent-icons", "workbench.iconTheme": "material-icon-theme", // Highlight modified/unsaved tabs "workbench.editor.highlightModifiedTabs": true, // Add some rulers "editor.rulers": [ 80, 100 ], // Indent with two spaces, but only for R "[r]": { "editor.tabSize": 2 }, // Nicer handling of end-of-document newlines, via // https://rfdonnelly.github.io/posts/sane-vscode-whitespace-settings/ "files.insertFinalNewline": true, "editor.renderFinalNewline": "dimmed", "editor.renderWhitespace": "trailing", "files.trimFinalNewlines": true, "files.trimTrailingWhitespace": true, // Various editor settings "editor.formatOnPaste": true, "editor.detectIndentation": false, "editor.showFoldingControls": "always", "window.newWindowDimensions": "inherit", "editor.scrollBeyondLastLine": false, "window.title": "${activeEditorFull}${separator}${rootName}", "editor.tabSize": 4, "editor.wordWrap": "on", "editor.multiCursorModifier": "ctrlCmd", "editor.snippetSuggestions": "top", // Hide things from the global search menu and watcher "files.exclude": { "**/.Rhistory": true, "**/.Rproj": true, "**/.Rproj.user": true, "**/renv/library": true, "**/renv/local": true, "**/renv/staging": true }, "files.watcherExclude": { "**/.Rproj/*": true, "**/renv/library": true, "**/renv/local": true, "**/renv/staging": true }, // Sign git commits "git.enableCommitSigning": true, // Extension-specific settings // ------------------------------------------------------------------------- // Markdown linting settings (idk if this stuff even works with Quarto though) "markdownlint.config": { "default": true, "MD012": { "maximum": 2 }, "MD025": false, "MD041": false }, // Wrap at 80 columns with the "Rewrap" extension "rewrap.wrappingColumn": 80, // Hacky "Open Remote - SSH" settings "remote.SSH.serverDownloadUrlTemplate": "https://github.com/gitpod-io/openvscode-server/releases/download/openvscode-server-v${version}/openvscode-server-v${version}-${os}-${arch}.tar.gz", "remote.SSH.experimental.serverBinaryName": "openvscode-server", // Don't phone home for the "YAML" extension "redhat.telemetry.enabled": false, }

My keyboard shortcuts

How to change keyboard shortcuts in Positron / VS Code

Changing keyboard shortcuts is just like changing settings. All the settings are stored in a JSON file (keybindings.json) located in a special folder on your computer, but you don’t have to work with raw JSON if you don’t want to.

The easiest way to get to the keyboard shortcut settings page is to open the Command Palette (⌘⇧P on macOS; ctrl + shift + p on Windows) and search for “Open Keyboard Shortcuts”:

This will give you a nice page for changing different settings. There are hundreds of possible shortcuts, but there’s a nice filtering system you can use to narrow things down.

If you click on the little document icon at the top, it will open the actual JSON file, just like with settings.json:

Accessing keyboard shortcuts from the Command Palette
Keyboard shortcut editor
Keyboard shortcuts as JSON
Enabling Positron’s RStudio Keymap option with rstudio.keymap.enable takes care of like 90% of my keyboard customization needs. Years ago when I first switched to VS Code, I changed several of RStudio’s keyboard shortcuts to match VS Code’s like ⌘/ for toggling commented code instead of RStudio’s default ⌘⇧C. Positron uses ⌘/ by default for comment toggling too, but when you enable the RStudio Keymap option, that gets overridden with ⌘⇧C, so I disable that.

RStudio also uses ⌘D for deleting a line, while VS Code uses it for adding text to a selection (i.e. if I select the word “the” in this document and then press ⌘D a bunch of times, it’ll add all those “the”s to the selection). The RStudio Keymap option adds ⌘D to delete the current line, so I disable that shortcut too to bring things back in line with standard VS Code.

Finally, I use iTerm2 for macOS for my systemwide terminal, and I have it configured with a global hotkey ^` so I can access the terminal from everywhere. This conflicts with VS Code’s and Positron’s terminal toggling shortcut, which is the same, so I change it to be ^⇧`.

Here’s my keybindings.json file. Like with settings.json, these are also accessible in the GUI.

Custom keyboard shortcuts

keybindings.json

[ { "key": "ctrl+alt+`", "command": "workbench.action.terminal.new", "when": "terminalProcessSupported || terminalWebExtensionContributedProfile" }, { "key": "ctrl+shift+`", "command": "-workbench.action.terminal.new", "when": "terminalProcessSupported || terminalWebExtensionContributedProfile" }, { "key": "ctrl+shift+`", "command": "workbench.action.terminal.toggleTerminal", "when": "terminal.active" }, { "key": "ctrl+`", "command": "-workbench.action.terminal.toggleTerminal", "when": "terminal.active" }, { "key": "shift+cmd+c", "command": "-editor.action.commentLine", "when": "config.rstudio.keymap.enable && editorTextFocus" }, { "key": "alt+cmd+q", "command": "rewrap.rewrapComment", "when": "editorTextFocus" }, { "key": "alt+q", "command": "-rewrap.rewrapComment", "when": "editorTextFocus" }, { "key": "cmd+d", "command": "-editor.action.deleteLines", "when": "config.rstudio.keymap.enable && editorTextFocus" } ]

My extensions

How to install extensions in Positron / VS Code

Installing extensions in Positron / VS Code is super straightforward (see here). Click on the Extensions icon in the main Activity Bar, search for an extension, and click on “Install”. You can also disable or uninstall existing extensions from here.

Extension page for Stan
One of the best things about Positron is that it has access to most of VS Code’s extensions. Positron is not allowed to access Microsoft’s Visual Studio Extension Marketplace, but it can access (and is a major sponsor of) the alternative Open VSX Registry. With the exception of Microsoft’s extensions like GitHub Copilot, Dev Containers, and Remote - SSH, Open VSX had pretty much all the extensions that I already regularly use in VS Code.

The only minor VS Code extension I normally use that I couldn’t install in Positron was Stata Enhanced (Not that I even ever use Stata—I don’t have it installed on my computer and don’t have a license, but it’s nice to be able to open .do files and see syntax highlighting). Stata Enhanced isn’t listed at Open VSX, but I’ve opened an issue requesting that it gets listed.

Here’s what I use:

Managing other environments

Docker: Manage Docker containers and volumes; right click on docker-compose.yml files to spin them up and shut them down; syntax highlighting for Dockerfiles and Docker Compose

Open Remote - SSH: Connect to remote servers with SSH. This is bundled with Positron and there’s no need to install anything.

Text editing

Rewrap: Automatically add line breaks in long comments or text (I have it set to wrap at 80 characters using ⌘⌥Q)

Better Comments: Add special syntax highlighting for some types of comments like TODO, ?, !, and so on

Shebang Snippets: Provides snippets for adding shebang directives (e.g. type #!python to get #!/usr/bin/env python)

Viewers and syntaxes

Excel Viewer: View .xlsx files

vscode-pdf: View PDFs

Rainbow CSV: Does neat syntax highlighting for CSV files (highlighting each column with specific colors)

Stan: Syntax highlighting for Stan

YAML: Syntax highlighting for YAML

Lua: Syntax highlighting for Lua

markdownlint: Linting and style suggestions for Markdown

Theme stuff

Material Icon Theme: Customize the icons associated with specific file types in the file explorer

Fluent Icons: Customize the icons in the general Positron app (primarily the icons in the Activity Bar, like Explorer, Search, Source Control, etc.)

Remote connections with SSH

One of the best features of VS Code is its ability to connect to remote servers through SSH, but because that’s enabled with a special closed source Microsoft extension, it doesn’t work in Positron.

The Open Remote - SSH extension replicates Microsoft’s remote SSH extension, and it’s available at Open VSX. ~~However, it doesn’t work with Positron immediately—you’ll get an error when connecting.~~ This now works and there’s no need to install anything! (See this for a previous partial workaround.)

August 2024 update!

As of August 2024, Positron now bundles an SSH extension that Just Works™. If you have R or Python installed on a remote server, you can connect to it and run code remotely and it’s all great and wonderful now.

Things I still wish Positron could do

Positron is still in beta and is undergoing rapid development, and that’s totally fine. Even though it’s not a finished product yet, it works really really well.

There are still some things I wish it could do though. Some of these will eventually be addressed; some can’t because of Microsoft.

Packages panel: I love RStudio’s Packages panel. It’s so helpful for seeing which packages are currently installed, which versions are installed, updating existing packages, and installing new ones.

RStudio’s Packages panel

Nothing like this exists for Positron right now, but there’s discussion about how to build something like it (that would also work for Python).

Plot dimensions: (This will hopefully be addressed someday). In RStudio, when working with Quarto and R Markdown documents, inline images use the dimensions that you set in the chunk options, which makes it really easy to tinker with plot dimensions (i.e. changing from fig-width: 2.5 to fig-width: 2.75 to make sure labels fit in the plot area). Current, plots in Positron show up in the plots panel and use whatever dimensions that panel is set to use, either by manually resizing it or by using a dropdown menu with specific sizes:

Positron’s plot panel

It would be cool if Positron’s plot panel could pick up the dimensions specified in a Quarto document and auto-resize to match. For now, I’ve just been using the “Custom Size” option. If I want to preview an image that’s 5 inches wide and 3.75 inches tall, I convert the ratio of width/height to pixels. It’s not exact—there are issues with different DPIs and retina screens—but it at least shows the correct proportion.

Custom sizes in Positron’s plot panel

Remote editing and execution with Open Remote - SSH: It would be incredible if (1) it were a lot easier to install Open Remote - SSH, and (2) it were possible to run code on remote servers. I think they’re working on supporting this.

Similar to this, but less important to me because I don’t use Docker containers this way, VS Code can work with Docker containers with the Dev Containers extension, similar to SSH, using Docker environments to run R/Python locally. They might be working on supporting this some day.

GitHub Copilot Chat: Being able to chat with GitHub Copilot in VS Code is fantastic and it’s like only LLM thing I use. But it only works through a closed source extension by Microsoft and probably won’t ever work outside of VS Code proper.

Citation
BibTeX citation:
@online{heiss2024, author = {Heiss, Andrew}, title = {Fun with {Positron}}, date = {2024-07-08}, url = {https://www.andrewheiss.com/blog/2024/07/08/fun-with-positron/}, doi = {10.59350/zs7da-17c67}, langid = {en} }
For attribution, please cite this work as:
Heiss, Andrew. 2024. “Fun with Positron.” July 8, 2024. https://doi.org/10.59350/zs7da-17c67.

Calculating the proportion of US state borders that are coastlines

Andrew Heiss — Wed, 08 May 2024 04:00:00 GMT
A few days ago, my wife, a bunch of my kids, and I were huddled around a big wall map of the United States, joking about the relative unimportance of Rhode Island, the smallest state in the US. It’s one of the states I never ever think about:

Tweet by me from May 6, 2020

…and it’s just so small.

Amid the joking, my wife came to Rhode Island’s defense by declaring that even though it’s so small, it has one of the highest proportions of coastline to land borders. We all gave it a metaphorical gold star for being so maritime-y and moved on with our days.

But as I thought about it later, I got curious about how much of Rhode Island’s border really is coastline and how that proportion compares to other states. New England in general has lots of inlets and islands; North Carolina has the complex Outer Banks; Louisiana has the Mississippi Delta; Michigan is split into two parts and surrounded by the Great Lakes; Florida is Florida. Lots of other states have lots of coastline.

Using R, the extremely powerful {sf} package for working with geospatial data, and some high quality public domain geographic data, we can find the actual answers about coastline proportions. Spoilers: Rhode Island does really well, as expected.

But doing this is a lot more complicated than you might think, both for technical reasons and for philosophical reasons.

Let’s explore this data and make some pretty maps!

library(tidyverse) library(tigris) library(sf) library(rnaturalearth) library(patchwork) library(gt) clrs <- rcartocolor::carto_pal(12, "Prism") clr_ocean <- colorspace::lighten("#88CCEE", 0.7) # Custom ggplot theme to make pretty plots # Get the font at https://fonts.google.com/specimen/Overpass theme_map <- function() { theme_void(base_family = "Overpass Light") + theme( plot.title = element_text(family = "Overpass", face = "bold", hjust = 0.5), plot.subtitle = element_text(family = "Overpass", face = "plain", hjust = 0.5) ) } update_geom_defaults("text", list(family = "Overpass"))

Use US Census data on states and coastlines

At first glance, calculating the proportion of coastline borders in states feels fairly straightforward. We take state boundaries, find where they intersect with coastline boundaries, extract those overlapping sections, and voilá—we’re done.

The US Census Bureau even has shapefiles ready to use, like US state borders and the national coastline for 2023, and the {tigris} package makes it really easy to load that data directly into R.

First, let’s grab US state data at medium resolution (1:5 million) and calculate the length of each state’s border. To make calculations easier, we’ll change the projection to Albers, which measures distances in meters instead of the default NAD 83 decimal degree system.

census_states <- states( cb = TRUE, resolution = "5m", year = 2023, progress_bar = FALSE, keep_zipped_shapefile = TRUE ) |> st_transform(crs = st_crs("ESRI:102003")) |> # Albers mutate( border_length = st_perimeter(geometry), border_length_miles = units::set_units(border_length, "miles") ) census_states |> select(NAME, border_length, border_length_miles) ## Simple feature collection with 56 features and 3 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -10430000 ymin: -1685000 xmax: 3408000 ymax: 5141000 ## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic ## First 10 features: ## NAME border_length border_length_miles geometry ## 1 New Mexico 2389194 [m] 1484.6 [miles] MULTIPOLYGON (((-1231344 -5... ## 2 Puerto Rico 683648 [m] 424.8 [miles] MULTIPOLYGON (((3306526 -15... ## 3 Texas 6218137 [m] 3863.8 [miles] MULTIPOLYGON (((-1e+06 -570... ## 4 Kentucky 2097216 [m] 1303.1 [miles] MULTIPOLYGON (((584560 -886... ## 5 Ohio 1614692 [m] 1003.3 [miles] MULTIPOLYGON (((1094061 536... ## 6 Georgia 1950980 [m] 1212.3 [miles] MULTIPOLYGON (((939223 -230... ## 7 Arkansas 2117255 [m] 1315.6 [miles] MULTIPOLYGON (((122656 -111... ## 8 Oregon 2301152 [m] 1429.9 [miles] MULTIPOLYGON (((-2285910 94... ## 9 Pennsylvania 1579842 [m] 981.7 [miles] MULTIPOLYGON (((1287712 486... ## 10 Missouri 2357068 [m] 1464.6 [miles] MULTIPOLYGON (((19009 34499...

Next we’ll grab coastline data and also convert it to the meter-based Albers projection:

census_coastline <- coastline( year = 2023, progress_bar = FALSE, keep_zipped_shapefile = TRUE ) |> st_transform(crs = st_crs("ESRI:102003")) census_coastline ## Simple feature collection with 4236 features and 2 fields ## Geometry type: LINESTRING ## Dimension: XY ## Bounding box: xmin: -10430000 ymin: -1685000 xmax: 3408000 ymax: 5140000 ## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic ## First 10 features: ## NAME MTFCC geometry ## 1 Atlántico L4150 LINESTRING (3232962 -158263... ## 2 Atlántico L4150 LINESTRING (3282570 -157424... ## 3 Atlántico L4150 LINESTRING (3278311 -157345... ## 4 Atlántico L4150 LINESTRING (3283967 -157426... ## 5 Atlántico L4150 LINESTRING (3282914 -157442... ## 6 Atlántico L4150 LINESTRING (3282240 -157431... ## 7 Atlántico L4150 LINESTRING (3280919 -157406... ## 8 Atlántico L4150 LINESTRING (3276748 -157350... ## 9 Atlántico L4150 LINESTRING (3276791 -157363... ## 10 Atlántico L4150 LINESTRING (3286366 -157464...

This coastline data isn’t state-based—instead, each row represents a segment of the US coastline. Here’s what it looks like when plotted:

ggplot() + geom_sf(data = census_coastline, linewidth = 0.1) + labs(title = "US Census coastline data") + coord_sf(crs = st_crs("ESRI:102003")) + theme_map()

To find where the two maps intersect, we can use st_intersection(), and then we can calculate the length of each combined segment with st_length():

census_combined <- census_states |> st_intersection(census_coastline) |> mutate(coastline_length = st_length(geometry)) census_combined |> select(NAME, border_length, coastline_length, geometry) ## Simple feature collection with 1276 features and 3 fields ## Geometry type: GEOMETRY ## Dimension: XY ## Bounding box: xmin: -10430000 ymin: -1685000 xmax: 3408000 ymax: 5140000 ## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic ## First 10 features: ## NAME border_length coastline_length geometry ## 2 Puerto Rico 683648 [m] 119.450 [m] LINESTRING (3282570 -157424... ## 2.1 Puerto Rico 683648 [m] 12.122 [m] LINESTRING (3282914 -157442... ## 2.2 Puerto Rico 683648 [m] 722.088 [m] LINESTRING (3282240 -157431... ## 2.3 Puerto Rico 683648 [m] 5.332 [m] LINESTRING (3280919 -157406... ## 2.4 Puerto Rico 683648 [m] 2014.901 [m] LINESTRING (3279028 -157391... ## 2.5 Puerto Rico 683648 [m] 2446.737 [m] LINESTRING (3081263 -163791... ## 2.6 Puerto Rico 683648 [m] 12209.219 [m] MULTILINESTRING ((3306976 -... ## 2.7 Puerto Rico 683648 [m] 2804.085 [m] MULTILINESTRING ((3178034 -... ## 2.8 Puerto Rico 683648 [m] 7978.399 [m] MULTILINESTRING ((3230856 -... ## 2.9 Puerto Rico 683648 [m] 6004.139 [m] MULTILINESTRING ((3172336 -...

Some of these segments are thousands of miles; some are only a few miles. We can do some grouping and summarizing to collapse these into single values for each state. border_length is a state-level variable, not a border-segment-level variable, so it’s repeated in each of the rows of the combined dataset, so we only need to keep one of the values—here I keep the max, but min would work (since they’re the same).

coastline_length_by_state <- census_combined |> st_drop_geometry() |> # Stop worrying about geographic stuff group_by(NAME) |> summarize( total_coastline_length = sum(coastline_length), total_perimeter = max(border_length), prop_coastline = as.numeric(total_coastline_length / total_perimeter) ) coastline_length_by_state ## # A tibble: 35 × 4 ## NAME total_coastline_length total_perimeter prop_coastline ## [m] [m] ## 1 Alabama 192355. 1907670. 0.101 ## 2 Alaska 19223761. 29671399. 0.648 ## 3 American Samoa 175827. 172059. 1.02 ## 4 California 1129738. 4191420. 0.270 ## 5 Commonwealth of the Northern Mariana Islands 261865. 327378. 0.800 ## 6 Connecticut 77127. 574613. 0.134 ## 7 Delaware 110370. 433573. 0.255 ## 8 Florida 1842297. 3795469. 0.485 ## 9 Georgia 114280. 1950980. 0.0586 ## 10 Guam 120734. 133668. 0.903 ## # ℹ 25 more rows

At first glance, this looks fine. We have each state’s total border length, coastline length, and a proportion. Neat.

But some of these proportions look wrong, like Hawaiʻi:

coastline_length_by_state |> filter(NAME %in% c("Hawaii", "North Carolina")) ## # A tibble: 2 × 4 ## NAME total_coastline_length total_perimeter prop_coastline ## [m] [m] ## 1 Hawaii 712716. 1413312. 0.504 ## 2 North Carolina 248590. 3498277. 0.0711

According to this, only 50% of Hawaiʻi’s borders are on the coast. That’s obviously wrong—that state shares no borders with other states and it’s in the middle of the Pacific Ocean.

If we plot the two datasets, we can see what’s going on. First, the resolutions of the state data and coastline data don’t quite match. Check out Oʻahu here—the purple coastline crosses some bays and misses some of the landmass:

# Extract the state boundaries census_hi <- census_states |> filter(NAME == "Hawaii") # Extract the bounding box around the state so we can zoom in on the coastline map bbox_hi <- census_hi |> st_transform(st_crs("EPSG:4269")) |> st_bbox() ggplot() + geom_sf(data = census_hi, linewidth = 0.1, fill = clrs[5]) + geom_sf(data = census_coastline, aes(color = "US Census coastline data")) + annotate(geom = "text", x = I(0.425), y = I(0.89), label = "Oʻahu") + annotate( geom = "rect", xmin = I(0.35), xmax = I(0.5), ymin = I(0.63), ymax = I(0.85), color = clrs[12], fill = NA, linetype = "21" ) + scale_color_manual(values = c(clrs[11])) + labs(title = "Hawaiʻi", color = NULL) + coord_sf( xlim = bbox_hi[c(1, 3)], ylim = bbox_hi[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme(legend.position = "top")

This mismatch between datasets is even more obvious if we look at a state like North Carolina. The actual landmass in the Outer Banks along the east coast is complex, with all sorts of inlets and a big long barrier island, but the Census simplifies it down substantially (with good reason(!) as we’ll see later):

census_nc <- census_states |> filter(NAME == "North Carolina") bbox_nc <- census_nc |> st_transform(st_crs("EPSG:4269")) |> st_bbox() ggplot() + geom_sf(data = census_nc, linewidth = 0.1, fill = clrs[6]) + geom_sf(data = census_coastline, aes(color = "US Census coastline data")) + annotate( geom = "rect", xmin = I(0.57), xmax = I(0.97), ymin = I(0.005), ymax = I(0.995), color = clrs[12], fill = NA, linetype = "21" ) + scale_color_manual(values = c(clrs[11])) + labs(title = "North Carolina", color = NULL) + coord_sf( xlim = bbox_nc[c(1, 3)], ylim = bbox_nc[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme(legend.position = "top")

This mismatch in borders and coastlines matter a lot for calculations. We’re using st_intersection() to find where the two maps overlap. If the two maps are misaligned, we can’t identify the correct overlaps, which means we can’t identify the full coastal border. Remember how we calculated that only 50% of Hawaiʻi’s borders are coastlines? If we plot the coastal borders of Hawaiʻi, we can see why:

ggplot() + geom_sf(data = census_combined, linewidth = 0.2) + labs(title = "Incomplete coastline overlaps in Hawaiʻi") + coord_sf( xlim = bbox_hi[c(1, 3)], ylim = bbox_hi[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map()

There are a couple issues here. The dots scattered along the borders are a sign that there’s some misalignment between the two maps. The borders for the states sometimes cross the coastline borders in a single point instead of following the coast exactly. But the even bigger issue is that these coastlines look sketched out—there are so many major gaps that roughly 50% of the borders are missing. That’s again because of misalignment with the two maps. The borders don’t overlap exactly, so st_intersection() can’t pick them up.

So what do we do? We could try finding different coastline data that matches the same resolution as the Census data. The National Oceanic and Atmospheric Administration (NOAA) and the US Geological Survey (USGS) each have their own shoreline datasets, and we could download those and load them into R and hope that they align with one of the Census’s state maps. But they don’t.

So we give up.

Use the lack of land as the coastline

Just kidding. There’s a better solution.

We already kind of have coastline data embedded in the state map data. For states that border an ocean or lake, anywhere the blue of the water touches the land is technically a coastline. Take Hawaiʻi, for instance, where the whole state border is the coastline:

ggplot() + geom_sf(data = census_hi, linewidth = 0.1, fill = clrs[5]) + labs(title = "The islands of Hawaiʻi") + coord_sf(crs = st_crs("EPSG:4269")) + # NAD83 theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

When looking at states with interior borders (like the other 49 states), though, the state map data has no way to distinguish which of those borders touch the ocean or touch other states. The east coast of North Carolina touches the ocean, but the northern, southern, and western borders do not. We need to somehow figure out which borders don’t touch other states.

ggplot() + geom_sf(data = census_nc, linewidth = 0.1, fill = clrs[6]) + annotate(geom = "text", x = I(0.6), y = I(0.89), label = "↑ Virginia ↑") + annotate(geom = "text", x = I(0.58), y = I(0.3), label = "↓ South Carolina ↓", angle = 313) + annotate(geom = "text", x = I(0.14), y = I(0.5), label = "↓ Georgia") + annotate(geom = "text", x = I(0.27), y = I(0.64), label = "← Tennessee") + labs(title = "The “island” of North Carolina") + coord_sf(crs = st_crs("EPSG:4269")) + # NAD83 theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

We do have data about Virginia, South Carolina, Tennessee, and Georgia, though, so there is a way to know that those borders aren’t ocean borders.

One way to determine which borders are interior and which ones are coastal is to create a big unified shape of the United States and then use the shape of North Carolina as a kind of cookie cutter. We can assume that any exposed edges are coasts.

First, we’ll make our big unified country shape:

census_us_giant <- census_states |> st_union() |> st_transform(crs = st_crs("ESRI:102003")) ggplot() + geom_sf(data = census_us_giant, linewidth = 0, fill = clrs[1]) + labs(title = "One big US-shaped shape") + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

Before trying this with North Carolina, we’ll test the cookie cutter selection with Hawaiʻi, since we know that 100% of its borders are coastlines. If we use the Hawaiʻi shape to take a chunk out of the overall US shape, we get…

hi_ocean_border_census <- census_hi |> st_difference(census_us_giant) ggplot() + geom_sf(data = hi_ocean_border_census) + labs(title = "lol nothing") + theme_map() + theme( panel.border = element_rect(color = "black", fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

…nothing.

The state cookie cutter was too perfect and selected right up to the edge of the overall country, leaving nothing.

To fix this, we can expand the state shape just a tiiiiiiny bit. We can use st_buffer() to add a tiny amount of distance all around the shape—since we’re using the Albers projection, we’re working in meters, so let’s add just 1 millimeter around the border before finding the difference:

hi_ocean_border_census <- census_hi |> st_buffer(dist = 0.001) |> st_difference(census_us_giant) ggplot() + geom_sf(data = hi_ocean_border_census, linewidth = 0.2) + labs(title = "Hawaiʻi’s coastal borders extracted from US shape") + coord_sf( xlim = bbox_hi[c(1, 3)], ylim = bbox_hi[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map()

Perfect.

Well, almost perfect. We’ve identified the coastal borders, but adding the buffer actually distorts things when we calculate the length of the border.

Let’s find the coast-to-border proportion for Hawaiʻi, which should be 100%. First we’ll find the perimeter of the state:

hi_border_perimeter <- census_hi |> st_perimeter() hi_border_perimeter ## 1413312 [m] units::set_units(hi_border_perimeter, "miles") ## 878.2 [miles]

Hawaiʻi’s border is 1,413,312 meters, or 878 miles. Sounds reasonable.

Next we’ll find the perimeter of the coastline:

hi_ocean_border_perimeter <- hi_ocean_border_census |> st_perimeter() hi_ocean_border_perimeter ## 2826623 [m] units::set_units(hi_ocean_border_perimeter, "miles") ## 1756 [miles]

Hrm. Hawaiʻi’s coastline is 2,826,623 meters, or 1,756 miles. That’s actually exactly twice the correct border length:

as.numeric(hi_ocean_border_perimeter / hi_border_perimeter) ## [1] 2

This happened because of the 1 mm buffer that we added around the state shape. Adding the buffer transformed the line into a polygon—a super tiny 1 mm-narrow polygon, but a polygon nonetheless. That means the border technically has a top and sides and a bottom. When calculating the length or perimeter of the border, we’re double counting because we’re getting both the top and the bottom of the hyper-thin polygon.

To better illustrate what’s going on, let’s add a three kilometer buffer around the borders.

hi_ocean_border_census_huge_buffer <- census_hi |> st_buffer(dist = 3000) |> st_difference(census_us_giant) ggplot() + geom_sf( data = hi_ocean_border_census_huge_buffer, linewidth = 0.2, fill = colorspace::lighten(clrs[5], 0.5) ) + labs(title = "Hawaiʻi’s borders with a 3 km buffer") + coord_sf( xlim = bbox_hi[c(1, 3)], ylim = bbox_hi[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map()

Calculating the perimeter of these borders will add the inside ring and the outside ring, effectively doubling the distance.

It’s less obvious that this doubling is happening when we add just 1 mm, but it is, and it’s leading to incorrect calculations. Fortunately it’s easy to adjust—we can halve the ocean border distance. Coastal borders comprise 100% of Hawaiʻi’s state borders, as expected:

as.numeric((hi_ocean_border_perimeter / 2) / hi_border_perimeter) |> scales::label_percent()() ## [1] "100%"

Let’s do the same thing with North Carolina:

nc_ocean_border_census <- census_nc |> st_buffer(dist = 0.001) |> st_difference(census_us_giant) ggplot() + geom_sf(data = nc_ocean_border_census, linewidth = 0.2) + labs(title = "North Carolina’s coastal borders extracted from US shape") + coord_sf(crs = st_crs("EPSG:4269")) + theme_map()

Perfect! Those are all the state borders that don’t touch other states. That’s the coastline.

We can visualize this better by plotting the larger country shape, the North Carolina shape, and the coastal border:

ggplot() + geom_sf(data = census_us_giant, fill = clrs[1], alpha = 0.4) + geom_sf(data = census_nc, linewidth = 0.1, fill = clrs[6]) + geom_sf( data = nc_ocean_border_census, linewidth = 0.4, aes(color = "Coastal border extracted from US shape"), key_glyph = draw_key_path ) + annotate(geom = "text", x = I(0.6), y = I(0.89), label = "↑ Virginia ↑") + annotate(geom = "text", x = I(0.58), y = I(0.3), label = "↓ South Carolina ↓", angle = 313) + annotate(geom = "text", x = I(0.14), y = I(0.5), label = "↓ Georgia") + annotate(geom = "text", x = I(0.27), y = I(0.64), label = "← Tennessee") + scale_color_manual(values = c(clrs[8])) + labs(title = "North Carolina and its coastal border", color = NULL) + coord_sf( xlim = bbox_nc[c(1, 3)], ylim = bbox_nc[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)), legend.key.height = unit(0.5, "lines"), legend.key = element_rect(fill = NA, color = NA), legend.position = "top" )

And we can find the proportion of the state’s borders that are coastline:

nc_border_perimeter <- census_nc |> st_perimeter() nc_border_perimeter ## 3498277 [m] units::set_units(nc_border_perimeter, "miles") ## 2174 [miles] nc_ocean_border_perimeter <- nc_ocean_border_census |> st_perimeter() nc_ocean_border_perimeter / 2 ## 1930852 [m] units::set_units(nc_ocean_border_perimeter / 2, "miles") ## 1200 [miles] as.numeric((nc_ocean_border_perimeter / 2) / nc_border_perimeter) |> scales::label_percent()() ## [1] "55%"

This approach is great. We don’t need to worry about making sure the coastline map matches the resolution of the state map, since we’re using just one map. Everything aligns perfectly automatically.

BUT there’s one more wrinkle to worry about. This approach gets trickier with states that touch other countries, like Washington, where the western border touches the ocean, the northern border touches Canada, and the eastern and southern borders touch other states. The state data knows about Idaho and Oregon, but it doesn’t know that Canada is there. Washington looks like it has an exterior, ocean-facing western border—Canada is replaced with an ocean.

census_wa <- census_states |> filter(NAME == "Washington") bbox_wa <- census_wa |> st_transform(st_crs("EPSG:4269")) |> st_bbox() wa_ocean_border_census <- census_wa |> st_buffer(dist = 0.001) |> st_difference(census_us_giant) ggplot() + geom_sf(data = census_us_giant, fill = clrs[1], alpha = 0.4) + geom_sf(data = census_wa, linewidth = 0.1, fill = clrs[3]) + geom_sf( data = wa_ocean_border_census, linewidth = 0.4, aes(color = "Coastal border extracted from US shape"), key_glyph = draw_key_path ) + annotate(geom = "text", x = I(0.64), y = I(0.91), label = "↑ Canada (British Columbia) ↑") + annotate(geom = "text", x = I(0.54), y = I(0.18), label = "↓ Oregon ↓") + annotate(geom = "text", x = I(0.87), y = I(0.5), label = "Idaho →") + scale_color_manual(values = c(clrs[8])) + labs(title = "Washington and the fake Canadian ocean", color = NULL) + coord_sf( xlim = bbox_wa[c(1, 3)], ylim = bbox_wa[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)), legend.key.height = unit(0.5, "lines"), legend.key = element_rect(fill = NA, color = NA), legend.position = "top" )

The big US-shaped polygon isn’t big enough for identifying land borders in other countries. In order to identify all the ocean-facing state borders, we need data about Canada and Mexico, and the US Census doesn’t provide that.

So we give up.

Use data from Natural Earth

Just kidding. There’s a better solution.

The Census doesn’t have data on other countries, but other data sources do! The incredible Natural Earth project has a ton of shapefiles for the entire world, both physical (i.e. rivers, coastlines, etc.) and cultural, like borders for countries (what it calls Admin-0) and states/provinces (what it calls Admin-1). It offers three levels of resolution: 1:10 million (high resolution), 1:50 million (medium resolution), and 1:110 million (low resolution).

Plus, like {tigris}, the {rnaturalearth} package makes it really easy to load that data directly into R.

Let’s follow the same process for Hawaiʻi, North Carolina, and Washington using Natural Earth data instead of Census data.

First, we’ll grab high resolution (1:10 million) maps from the Admin-1 (states and provinces) data:

if (!file.exists("ne_data/ne_10m_admin_1_states_provinces_lakes.shp")) { ne_download( type = "admin_1_states_provinces_lakes", scale = 10, destdir = "ne_data", load = FALSE ) } states_provinces <- ne_load( type = "admin_1_states_provinces_lakes", scale = 10, destdir = "ne_data" )

We only need to work with three countries, not the entire world, so we’ll filter it down to just North America before joining everything into a single shape with st_union():

na_giant <- states_provinces |> filter(admin %in% c("United States of America", "Canada", "Mexico")) |> st_union() |> st_transform(crs = st_crs("ESRI:102003")) # Convert to Albers for meters ggplot() + geom_sf(data = na_giant, linewidth = 0, fill = clrs[10]) + labs(title = "One big North America-shaped shape") + coord_sf(crs = st_crs("ESRI:102003")) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

(That’s so pretty.)

Next, we’ll extract all the US states from the Natural Earth data, since we want this big North American shape to match the states exactly:

ne_states <- states_provinces |> filter(admin == "United States of America") |> st_transform(crs = st_crs("ESRI:102003")) ggplot() + geom_sf(data = na_giant, linewidth = 0, fill = clrs[10]) + geom_sf(data = ne_states, linewidth = 0.05, fill = clrs[8], color = "white") + labs(title = "The United States overlaid on North America") + coord_sf(crs = st_crs("ESRI:102003")) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)) )

Now we’ll go through the same process as before, using state shapes as cookie cutters from the larger North American shape and extracting non-interior borders.

Hawaiʻi

Here are Hawaiʻi’s extracted borders—because we’re working with high resolution data, we get a lot more detail, including all the Northwestern Hawaiian Islands:

ne_hi <- ne_states |> filter(name == "Hawaii") bbox_ne_hi <- ne_hi |> st_transform(st_crs("EPSG:4269")) |> st_bbox() hi_ocean_border_ne <- ne_hi |> st_buffer(dist = 0.001) |> st_difference(na_giant) ggplot() + geom_sf(data = hi_ocean_border_ne, linewidth = 0.2) + labs(title = "Hawaiʻi’s coastal borders extracted from North America shape") + coord_sf( xlim = bbox_ne_hi[c(1, 3)], ylim = bbox_ne_hi[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map()

As before, we can calculate the perimeter of the state, the perimeter of the coastline, and find the proportion. It’s 100%, as expected:

# State border hi_border_perimeter_ne <- ne_hi |> st_perimeter() hi_border_perimeter_ne ## 1458906 [m] units::set_units(hi_border_perimeter_ne, "miles") ## 906.5 [miles] # Coastal border hi_ocean_border_perimeter_ne <- hi_ocean_border_ne |> st_perimeter() hi_ocean_border_perimeter_ne / 2 ## 1458906 [m] units::set_units(hi_ocean_border_perimeter_ne / 2, "miles") ## 906.5 [miles] # Proportion as.numeric((hi_ocean_border_perimeter_ne / 2) / hi_border_perimeter_ne) |> scales::label_percent()() ## [1] "100%"

North Carolina

The same process works for North Carolina

ne_nc <- ne_states |> filter(name == "North Carolina") bbox_ne_nc <- ne_nc |> st_transform(st_crs("EPSG:4269")) |> st_bbox() nc_ocean_border_ne <- ne_nc |> st_buffer(dist = 0.001) |> st_difference(na_giant) ggplot() + geom_sf(data = nc_ocean_border_ne, linewidth = 0.2) + labs(title = "North Carolina’s coastal borders\nextracted from North America shape") + coord_sf(crs = st_crs("EPSG:4269")) + theme_map()

Here’s that extracted coastline with the rest of the state:

ggplot() + geom_sf(data = na_giant, fill = clrs[10], alpha = 0.4) + geom_sf(data = ne_nc, linewidth = 0.1, fill = clrs[6]) + geom_sf( data = nc_ocean_border_ne, linewidth = 0.4, aes(color = "Coastal border extracted from North America shape"), key_glyph = draw_key_path ) + annotate(geom = "text", x = I(0.6), y = I(0.89), label = "↑ Virginia ↑") + annotate(geom = "text", x = I(0.58), y = I(0.3), label = "↓ South Carolina ↓", angle = 313) + annotate(geom = "text", x = I(0.14), y = I(0.5), label = "↓ Georgia") + annotate(geom = "text", x = I(0.27), y = I(0.64), label = "← Tennessee") + scale_color_manual(values = c(clrs[8])) + labs(title = "The actual North Carolina coastline", color = NULL) + coord_sf( xlim = bbox_ne_nc[c(1, 3)], ylim = bbox_ne_nc[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)), legend.key.height = unit(0.5, "lines"), legend.key = element_rect(fill = NA, color = NA), legend.position = "top" )

Now we can calculate proportion of coastline:

# State border nc_border_perimeter_ne <- ne_nc |> st_perimeter() nc_border_perimeter_ne ## 4095093 [m] units::set_units(nc_border_perimeter_ne, "miles") ## 2545 [miles] # Coastal border nc_ocean_border_perimeter_ne <- nc_ocean_border_ne |> st_perimeter() nc_ocean_border_perimeter_ne / 2 ## 2596482 [m] units::set_units(nc_ocean_border_perimeter_ne / 2, "miles") ## 1613 [miles] # Proportion as.numeric((nc_ocean_border_perimeter_ne / 2) / nc_border_perimeter_ne) |> scales::label_percent()() ## [1] "63%"

Sharp-eyed readers will notice something odd, though! Using Census maps at 1:5 million resolution, we found that 55% of North Carolina’s borders were coastline. Using Natural Earth maps at 1:10 million resolution, it becomes 63%. I’ll talk more about that discrepancy later in this post (spoiler: it’s because of the new-to-me Coastline Paradox). The detail in the two maps is different, so the amount of landmass visible along the coastline is different, yielding different perimeters and distances.

Washington

When we were using Census data, we couldn’t use this process with Washington because Canada was invisible and was being treated as an ocean. Now that we have data from Natural Earth, Canada can be accounted for and we’ll get the correct ocean-facing border.

ne_wa <- ne_states |> filter(name == "Washington") bbox_ne_wa <- ne_wa |> st_transform(st_crs("EPSG:4269")) |> st_bbox() wa_ocean_border_ne <- ne_wa |> st_buffer(dist = 0.001) |> st_difference(na_giant) ggplot() + geom_sf(data = wa_ocean_border_ne, linewidth = 0.2) + labs(title = "Washington’s coastal borders\nextracted from North America shape") + coord_sf(crs = st_crs("EPSG:4269")) + theme_map()

Here’s that coastline in context:

ggplot() + geom_sf(data = na_giant, fill = clrs[10], alpha = 0.4) + geom_sf(data = ne_wa, linewidth = 0.1, fill = clrs[3]) + geom_sf( data = wa_ocean_border_ne, linewidth = 0.4, aes(color = "Coastal border extracted from North America shape"), key_glyph = draw_key_path ) + annotate(geom = "text", x = I(0.64), y = I(0.91), label = "↑ Canada (British Columbia) ↑") + annotate(geom = "text", x = I(0.54), y = I(0.18), label = "↓ Oregon ↓") + annotate(geom = "text", x = I(0.87), y = I(0.5), label = "Idaho →") + scale_color_manual(values = c(clrs[8])) + labs(title = "The actual Washington coastline", color = NULL) + coord_sf( xlim = bbox_ne_wa[c(1, 3)], ylim = bbox_ne_wa[c(2, 4)], crs = st_crs("EPSG:4269") ) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)), legend.key.height = unit(0.5, "lines"), legend.key = element_rect(fill = NA, color = NA), legend.position = "top" )

Since the northern state border is no longer seen as a coastal border, can calculate the correct proportion of coastline:

# State border wa_border_perimeter_ne <- ne_wa |> st_perimeter() wa_border_perimeter_ne ## 3913863 [m] units::set_units(wa_border_perimeter_ne, "miles") ## 2432 [miles] # Coastal border wa_ocean_border_perimeter_ne <- wa_ocean_border_ne |> st_perimeter() wa_ocean_border_perimeter_ne / 2 ## 2591575 [m] units::set_units(wa_ocean_border_perimeter_ne / 2, "miles") ## 1610 [miles] # Proportion as.numeric((wa_ocean_border_perimeter_ne / 2) / wa_border_perimeter_ne) |> scales::label_percent()() ## [1] "66%"

All states

Now that we know that the Natural Earth approach works, let’s apply it to all the states. We’ll nest the geographic data into a list column with a cell for each state, then extract the borders for each state

ne_coastline <- ne_states |> group_by(name) |> nest() |> mutate(ocean_only = map(data, ~ { .x |> st_buffer(dist = 0.001) |> st_difference(na_giant) })) |> unnest(ocean_only) |> ungroup() |> # This special column somehow lost its specialness st_set_geometry("geometry")

ggplot() + geom_sf(data = na_giant, linewidth = 0, fill = clrs[10], alpha = 0.5) + geom_sf( data = ne_coastline, linewidth = 0.2, aes(color = "Coastal border extracted from North America shape"), key_glyph = draw_key_path ) + scale_color_manual(values = c(clrs[8])) + labs(title = "All US coastal borders", color = NULL) + coord_sf(crs = st_crs("ESRI:102003")) + theme_map() + theme( panel.background = element_rect(fill = clr_ocean), plot.title = element_text(margin = margin(6.5, 0, 6.5, 0)), legend.key.height = unit(0.5, "lines"), legend.key = element_rect(fill = NA, color = NA), legend.position = "top" )

That looks great!

Finally, let’s put this in a table. We’ll calculate the perimeter of all the states, then join that data to the coastal data, and then calculate the proportion of coastline.

ne_state_border_lengths <- ne_states |> mutate(border_length = st_perimeter(geometry)) |> st_drop_geometry() |> select(name, border_length) coastline_length_by_state_ne <- ne_coastline |> mutate(coastline_length = st_perimeter(geometry) / 2) |> st_drop_geometry() |> left_join(ne_state_border_lengths, by = join_by(name)) |> mutate(prop_coastline = as.numeric(coastline_length / border_length)) |> mutate(rank = rank(-prop_coastline)) |> mutate( across(c(border_length, coastline_length), list(miles = ~units::set_units(., "miles"))) ) |> select( name, rank, prop_coastline, starts_with("border_length"), starts_with("coastline_length") )

This is fantastic! Hawaiʻi has the highest proportion of coastline, for obvious reasons, and an astounding 95% of Alaska’s borders touch the ocean, even though a huge chunk of the state shares a land border with Canada—there are just so many islands and inlets.

In the contiguous United States, island-y states like Florida, Michigan, and Louisiana have the highest proportion of coastline. Despite their small size, the coastal New England states like New Jersey, Massachusetts, and Rhode Island have a lot of coastline relative to the rest of their borders. Check out Rhode Island at #10—my wife’s offhand observation was pretty accurate! (These columns are sortable.)

Code

Calculating birthday probabilities with R instead of math

Andrew Heiss — Fri, 03 May 2024 04:00:00 GMT

Even though I’ve been teaching R and statistical programming since 2017, and despite the fact that I do all sorts of heavily quantitative research, I’m really really bad at probability math.

Like super bad.

The last time I truly had to do set theory and probability math was in my first PhD-level stats class in 2012. The professor cancelled classes after the first month and gave us all of October to re-teach ourselves calculus and probability theory (thank you Sal Khan), and then the rest of the class was pretty much all about pure set theory stuff. It was… not fun.

But I learned a valuable secret power from the class. During the final couple weeks of the course, the professor mentioned in passing that it’s possible to skip most of this probability math and instead use simulations to get the same answers. That one throwaway comment changed my whole approach to doing anything based on probabilities.

Why simulate?

In one problem set from November 2012¹, we had to answer this question using both actual probability math and R simulation:

¹ I wrote this in a .Rnw file! R Markdown wasn’t even a thing yet!

An urn contains 10 red balls, 10 blue balls, and 20 green balls. If 5 balls are selected at random without replacement, what is the probability that at least 1 ball of each color will be selected?

ew probability math

We can find this probability by finding the probability of not selecting one or more of the colors in the draw and subtracting it from 1. We need to find the probability of selecting no red balls, no blue balls, and no green balls, and then subtract the probability of the overlapping situations (i.e. no red or blue balls, no red or green balls, and no blue or green balls).

To do this, we can use n-choose-k notation from combinatorics to represent the number of choices from a pool of possible combinations. This notation looks like this:

If we’re selecting 5 balls from a pool of 40, we can say “40 choose 5”, or . To calculate that, we get this gross mess:

Or we can do it with R:

choose(40, 5) ## [1] 658008

So with this binomial choose notation, we can calculate the official formal probability of drawing at least one red, blue, and green ball from this urn:

If we really really wanted, we could then calculate all of that by hand, but ew.

We can just use R instead:

# Ways to draw 5 balls without getting a specific color no_red <- choose(30, 5) no_blue <- choose(30, 5) no_green <- choose(20, 5) # Ways to draw 5 balls without getting two specific colors no_red_blue <- choose(20, 5) no_red_green <- choose(10, 5) no_blue_green <- choose(10, 5) # Ways to draw 5 balls in general total_ways <- choose(40, 5) # Probability of drawing at least 1 of each color prob_real <- 1 - (no_red + no_blue + no_green - no_red_blue - no_red_green - no_blue_green) / total_ways prob_real ## [1] 0.5676

Great. There’s a 56.76% chance of drawing at least one of each color. We have an answer, but this was really hard, and I could only do it because I dug up my old problem sets from 2012.

yay brute force simulation

I really don’t like formal probability math. Fortunately there’s a way I find a heck of a lot easier to use. Brute force simulation.

Instead of figuring out all these weird n-choose-k probabilities, we’ll use the power of computers to literally draw from a hypothetical urn over and over and over again until we come to the right answer.

Here’s one way to do it:²

² Again, this is 2012-era R code; nowadays I’d forgo the loop and use something like purrr::map() or sapply().

# Make this randomness consistent set.seed(12345) # Make an urn with balls in it urn <- c(rep('red', 10), rep('blue', 10), rep('green', 20)) # How many times we'll draw from the urn simulations <- 100000 count <- 0 for (i in 1:simulations) { # Pick 5 balls from the urn draw <- sample(urn, 5) # See if there's a red, blue, and green; if so, record it if ('red' %in% draw && 'blue' %in% draw && 'green' %in% draw) { count <- count + 1 } } # Find the simulated probability prob_simulated <- count / simulations prob_simulated ## [1] 0.5681

Sweet. The simulation spat out 0.5681, which is shockingly close to 0.5676. If we boosted the number of simulations from 100,000 to something even higher,³ we’d eventually converge on the true answer.

³ Going up to 2,000,000 got me to 0.5676.

I use this simulation-based approach to anything mathy as much as I can. Personally, I find it far more intuitive to re-create the data generating process rather than think in set theory and combinatorics. In my program evaluation class, we do an in-class activity with the dice game Zilch where we figure out the probability of scoring something in a given dice roll. Instead of finding real probabilities, we just simulate thousands of dice rolls and mark if something was rolled. We essentially recreate the exact data generating process.

This approach is also the core of modern Bayesian statistics. Calculating complex integrals to find posterior distributions is too hard, so we can use Markov Chain Monte Carlo (MCMC) processes bounce around the plausible space for a posterior distribution until they settle on a stable value.

Birthday probabilities

A couple days ago, I came across this post on Bluesky:

Post by Karl Rohe (@karlrohe.bsky.social)

This is neat because it’s also the case in my household. We have “birthday season” from May to November, and have a dearth of birthdays from November to May. They’re all clustered in half the year. I’d never thought about how unlikely that was.

There’s probably some formal probability math that can answer Karl’s question precisely, but that’s hard. So instead, I decided to figure this out with my old friend—brute force simulation.

The data generating process is a little more complicated than just drawing balls from urns, and the cyclical nature of calendars adds an extra wrinkle to simulating everything, but it’s doable (and fun!), so I figured I’d share the details of the simulation process here. And with the recent release of {ggplot2} 3.5 and its new coord_radial() and legend placement settings and point-based text sizing and absolute plot-based positioning, I figured I’d make some pretty plots along the way.

Let’s load some libraries, make a custom theme, and get started!

library(tidyverse) library(ggtext) library(patchwork) clrs <- MetBrewer::met.brewer("Demuth") # Custom ggplot theme to make pretty plots # Get the font at https://fonts.google.com/specimen/Montserrat theme_calendar <- function() { theme_minimal(base_family = "Montserrat") + theme( axis.text.y = element_blank(), axis.title = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5) ) } update_geom_defaults("text", list(family = "Montserrat"))

Visualizing birthday distributions and spans

All birthdays within 6-month span

First, let’s work with a hypothetical household with four people in it with birthdays on January 4, March 10, April 28, and May 21. We’ll plot these on a radial plot and add a 6-month span starting at the first birthday:

# All these happen within a 6-month span birthdays_yes <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-05-21") ) tibble(x = birthdays_yes) |> ggplot(aes(x = x, y = "")) + annotate( geom = "segment", x = birthdays_yes[1], xend = birthdays_yes[1] + months(6), y = "", linewidth = 3, color = clrs[5]) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Yep", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar()

The four birthdays all fit comfortably within the 6-month span. Neat.

All birthdays within 6-month span, but tricky

Next, let’s change the May birthday to December 1. These four birthdays still all fit within a 6-month span, but it’s trickier to see because the calendar year resets in the middle. Earlier, we plotted the yellow span with annotate(), but if we do that now, it breaks and we get a warning. We can’t draw a line segment from December 1 to six months later:

birthdays_yes_but_tricky <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-12-01") ) tibble(x = birthdays_yes_but_tricky) |> ggplot(aes(x = x, y = "")) + annotate( geom = "segment", x = birthdays_yes_but_tricky[4], xend = birthdays_yes_but_tricky[4] + months(6), y = "", linewidth = 3, color = clrs[5]) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Yep\n(but broken)", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar() ## Warning: Removed 1 row containing missing values or values outside the scale range (`geom_segment()`).

Instead, we can draw two line segments—one from December 1 to December 31, and one from January 1 to whatever six months from December 1 is. Since this plot represents all of 2024, we’ll force the continued time after January 1 to also be in 2024 (even though it’s technically 2025). Here I colored the segments a little differently to highlight the fact that they’re two separate lines:

tibble(x = birthdays_yes_but_tricky) |> ggplot(aes(x = x, y = "")) + annotate( geom = "segment", x = birthdays_yes_but_tricky[4], xend = ymd("2024-12-31"), y = "", linewidth = 3, color = clrs[5]) + annotate( geom = "segment", x = ymd("2024-01-01"), xend = (ymd("2024-01-01") + months(6)) - (ymd("2024-12-31") - birthdays_yes_but_tricky[4]), y = "", linewidth = 3, color = clrs[4]) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Yep\n(but tricky)", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar()

Writing two separate annotate() layers feels repetitive, though, and it’s easy to make mistakes. So instead, we can make a little helper function that will create a data frame with the start and end date of a six-month span. If the span crosses December 31, it returns two spans; if not, it returns one span:

calc_date_arc <- function(start_date) { days_till_end <- ymd("2024-12-31") - start_date if (days_till_end >= months(6)) { x <- start_date xend <- start_date + months(6) } else { x <- c(start_date, ymd("2024-01-01")) xend <- c( start_date + days_till_end, (ymd("2024-01-01") + months(6)) - days_till_end ) } return(tibble(x = x, xend = xend)) }

Let’s make sure it works. Six months from March 15 is September 15, which doesn’t cross into a new year, so we get just one start and end date:

calc_date_arc(ymd("2024-03-15")) ## # A tibble: 1 × 2 ## x xend ## ## 1 2024-03-15 2024-09-15

Six months from November 15 is sometime in May, which means we do cross into a new year. We thus get two spans: (1) a segment from November to the end of December, and (2) a segment from January to May:

calc_date_arc(ymd("2024-11-15")) ## # A tibble: 2 × 2 ## x xend ## ## 1 2024-11-15 2024-12-31 ## 2 2024-01-01 2024-05-16

Plotting with this function is a lot easier, since it returns a data frame. We don’t need to worry about using annotate() anymore and can instead map the x and xend aesthetics from the data to the plot:

tibble(x = birthdays_yes_but_tricky) |> ggplot(aes(x = x, y = "")) + geom_segment( data = calc_date_arc(birthdays_yes_but_tricky[4]), aes(xend = xend), linewidth = 3, color = clrs[5] ) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Yep\n(but tricky)", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar()

Some birthdays outside a 6-month span

Let’s change the set of birthdays again so that one of them falls outside the six-month window. Regardless of where we start the span, we can’t collect all the points within a continuous six-month period:

birthdays_no <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-09-21") ) p1 <- tibble(x = birthdays_no) |> ggplot(aes(x = x, y = "")) + geom_segment( data = calc_date_arc(birthdays_no[1]), aes(xend = xend), linewidth = 3, color = clrs[3] ) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Nope", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar() p2 <- tibble(x = birthdays_no) |> ggplot(aes(x = x, y = "")) + geom_segment( data = calc_date_arc(birthdays_no[4]), aes(xend = xend), linewidth = 3, color = clrs[3] ) + geom_point(size = 5, fill = clrs[10], color = "white", pch = 21) + annotate( "text", label = "Still nope", fontface = "bold", x = I(0.5), y = I(0), size = 14, size.unit = "pt" ) + scale_x_date( date_breaks = "1 month", date_labels = "%B", limits = c(ymd("2024-01-01"), ymd("2024-12-31")), expand = expansion(0, 0) ) + scale_y_discrete(expand = expansion(add = c(0, 1))) + coord_radial(inner.radius = 0.8) + theme_calendar() (p1 | plot_spacer() | p2) + plot_layout(widths = c(0.45, 0.1, 0.45))

Simulation time!

So far, we’ve been working with a hypothetical household of 4, with arbitrarily chosen birthdays. For the simulation, we’ll need to work with randomly selected birthdays for households of varying sizes.

Counting simulated birthdays and measuring spans

But before we build the full simulation, we need to build a way to programmatically detect if a set of dates fit within a six-month span, which—as we saw with the plotting—is surprisingly tricky because of the possible change in year.

If we didn’t need to contend with a change in year, we could convert all the birthdays to their corresponding day of the year, sort them, find the difference between the first and the last, and see if it’s less than 183 days (366/2; we’re working with a leap year). This would work great:

birthdays_yes <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-05-21") ) birthdays_sorted <- sort(yday(birthdays_yes)) birthdays_sorted ## [1] 4 70 119 142 max(birthdays_sorted) - min(birthdays_sorted) ## [1] 138

But time is a circle. If we look at a set of birthdays that crosses a new year, we can’t just look at max - min. Also, there’s no guarantee that the six-month span will go from the first to the last; it could go from the last to the first.

birthdays_yes_but_tricky <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-12-01") ) birthdays_sorted <- sort(yday(birthdays_yes_but_tricky)) birthdays_sorted ## [1] 4 70 119 336 max(birthdays_sorted) - min(birthdays_sorted) ## [1] 332

So we need to do something a little trickier to account for the year looping over. There are several potential ways to do this, and there’s no one right way. Here’s the approach I settled on.

We need to check the span between each possible window of dates. In a household of 4, this means finding the distance between the 4th (or last) date and the 1st date, which we did with max - min. But we also need to check on the distance between the 1st date in the next cycle (i.e. 2025) and the 2nd date (in 2024), and so on, or this:

Distance between 4th and 1st:

ymd("2024-12-01") - ymd("2024-01-04") or 332 days

Distance between (1st + 1 year) and 2nd:

ymd("2025-01-04") - ymd("2024-03-10") or 300 days

Distance between (2nd + 1 year) and 3rd:

ymd("2025-03-10") - ymd("2024-04-28") or 316 days

Distance between (3rd + 1 year) and 4th:

ymd("2025-04-28") - ymd("2024-12-01") or 148 days

That last one, from December 1 to April 28, is less than 180 days, which means that the dates fit within a six-month span. We saw this in the plot earlier too—if we start the span in December, it more than covers the remaining birthdays.

One easy way to look at dates in the next year is to double up the vector of birthday days-of-the-year, adding 366 to the first set, like this:

birthdays_doubled <- c( sort(yday(birthdays_yes_but_tricky)), sort(yday(birthdays_yes_but_tricky)) + 366 ) birthdays_doubled ## [1] 4 70 119 336 370 436 485 702

The first four represent the regular real birthdays; the next four are the same values, just shifted up a year (so 4 and 370 are both January 1, etc.)

With this vector, we can now find differences between dates that cross years more easily:⁴

⁴ Some of these differences aren’t the same as before (317 instead of 361; 149 instead of 148). This is because 2024 is a leap year and 2025 is not, and {lubridate} accounts for that. By adding 366, we’re pretending 2025 is also a leap year. But that’s okay, because we want to pretend that February 29 happens each year.

birthdays_doubled[4] - birthdays_doubled[1] ## [1] 332 birthdays_doubled[5] - birthdays_doubled[2] ## [1] 300 birthdays_doubled[6] - birthdays_doubled[3] ## [1] 317 birthdays_doubled[7] - birthdays_doubled[4] ## [1] 149

Or instead of repeating lots of lines like that, we can auto-increment the different indices:

n <- length(birthdays_yes_but_tricky) map_dbl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x]) ## [1] 332 300 317 149

If any of those values are less than 183, the birthdays fit in a six-month span:

any(map_lgl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x] <= 183)) ## [1] TRUE

Let’s check it with the other two test sets of birthdays. Here’s the easy set of birthdays without any cross-year loops:

birthdays_yes <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-05-21") ) birthdays_doubled <- c( sort(yday(birthdays_yes)), sort(yday(birthdays_yes)) + 366 ) map_dbl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x]) ## [1] 138 300 317 343 any(map_lgl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x] <= 183)) ## [1] TRUE

And here’s the set we know doesn’t fit:

birthdays_no <- c( ymd("2024-01-04"), ymd("2024-03-10"), ymd("2024-04-28"), ymd("2024-09-21") ) birthdays_doubled <- c( sort(yday(birthdays_no)), sort(yday(birthdays_no)) + 366 ) map_dbl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x]) ## [1] 261 300 317 220 any(map_lgl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x] <= 183)) ## [1] FALSE

It works!

Actual simulation

Now that we have the ability to check if any set of dates fits within six months, we can generalize this to any household size. To make life a little easier, we’ll stop working with days of the year for. Leap years are tricky, and the results changed a little bit above if we added 366 or 365 to the repeated years. So instead, we’ll think about 360° in a circle—circles don’t suddenly have 361° every four years or anything weird like that.

In this simulation, we’ll generate n random numbers between 0 and 360 (where n is the household size we’re interested in). We’ll then do the doubling and sorting thing and check to see if the distance between any of the 4-number spans is less than 180.

simulate_prob <- function(n, num_simulations = 1000) { results <- map_lgl(1:num_simulations, ~{ birthdays <- runif(n, 0, 360) birthdays_doubled <- sort(c(birthdays, birthdays + 360)) any(map_lgl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x] <= 180)) }) mean(results) }

Here’s the probability of seeing all the birthdays in a six-month span in a household of 4:

withr::with_seed(1234, { simulate_prob(4) }) ## [1] 0.522

About 50%!

What about a household of 6, like in Karl’s original post?

withr::with_seed(1234, { simulate_prob(6) }) ## [1] 0.189

About 18%!

We can get this more precise and consistent by boosting the number of simulations:

withr::with_seed(1234, { simulate_prob(6, 50000) }) ## [1] 0.1866

Now that this function is working, we can use it to simulate a bunch of possible household sizes, like from 2 to 10:

simulated_households <- tibble(household_size = 2:10) |> mutate(prob_in_arc = map_dbl(household_size, ~simulate_prob(.x, 10000))) |> mutate(nice_prob = scales::label_percent(accuracy = 0.1)(prob_in_arc)) ggplot(simulated_households, aes(x = factor(household_size), y = prob_in_arc)) + geom_pointrange(aes(ymin = 0, ymax = prob_in_arc), color = clrs[2]) + geom_text(aes(label = nice_prob), nudge_y = 0.07, size = 8, size.unit = "pt") + scale_y_continuous(labels = scales::label_percent()) + labs( x = "Household size", y = "Probability", title = "Probability that all birthdays occur within \na single six-month span across household size", caption = "10,000 simulations" ) + theme_minimal(base_family = "Montserrat") + theme( panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), plot.caption = element_text(hjust = 0, color = "grey50"), axis.title.x = element_text(hjust = 0), axis.title.y = element_text(hjust = 1) )

But birthdays aren’t uniformly distributed!

We just answered Karl’s original question: “Suppose n points are uniformly distributed on a circle. What is the probability that they belong to a connected half circle”. There’s probably some official mathy combinatorial way to build a real formula to describe this pattern, but that’s too hard. Simulation gets us there.

Uneven birthday disributions

But in real life, birthdays aren’t actually normally distributed. There are some fascinating patterns in days of birth. Instead of drawing birthdays from a uniform distribution where every day is equally likely, let’s draw from the actual distribution. There’s no official probability-math way to do this—the only way to do this kind of calculation is with simulation.

The CDC and the Social Security Administration track the counts of daily births in the Unitd States. In 2016, FiveThirtyEight reported a story about patterns in daily birthrate frequencies and they posted their CSV files on GitHub, so we’ll load their data and figure out daily probabilities of birthdays.

births_1994_1999 <- read_csv( "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv" ) |> # Ignore anything after 2000 filter(year < 2000) births_2000_2014 <- read_csv( "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv" )

births_combined <- bind_rows(births_1994_1999, births_2000_2014) |> mutate( full_date = make_date(year = 2024, month = month, day = date_of_month), day_of_year = yday(full_date) ) |> mutate( month_cateogrical = month(full_date, label = TRUE, abbr = FALSE) ) glimpse(births_combined) ## Rows: 7,670 ## Columns: 8 ## $ year 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 19… ## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,… ## $ date_of_month 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… ## $ day_of_week 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5,… ## $ births 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 7910, 10498, 11706, 11567, 11212, 11570, 8660, 8123, 10567, 11541, 11257, 11682, 11811, 8833, 8310, 11125, 11981, 11514, 11702, 11666, 8988, 8096, 10765, 11755, 11483, 11523, 11677, 8991, 8309, 10984, 12152, 11515, 1162… ## $ full_date 2024-01-01, 2024-01-02, 2024-01-03, 2024-01-04, 2024-01-05, 2024-01-06, 2024-01-07, 2024-01-08, 2024-01-09, 2024-01-10, 2024-01-11, 2024-01-12, 2024-01-13, 2024-01-14, 2024-01-15, 2024-01-16, 2024-01-17, 2024-01-18, 2024-01-19, 2024-01-20, 2024-01-21, 2024-01-22, 2024-01… ## $ day_of_year 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 7… ## $ month_cateogrical January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, January, Ja…

But first, just because this is one of my favorite graphs ever, let’s visualize the data! Here’s a heatmap showing the daily average births for 366 days:

avg_births_month_day <- births_combined |> group_by(month_cateogrical, date_of_month) %>% summarize(avg_births = mean(births)) ggplot( avg_births_month_day, aes(x = factor(date_of_month), y = fct_rev(month_cateogrical), fill = avg_births) ) + geom_tile() + scale_fill_viridis_c( option = "rocket", labels = scales::label_comma(), guide = guide_colorbar(barwidth = 20, barheight = 0.5, position = "bottom") ) + labs( x = NULL, y = NULL, title = "Average births per day", subtitle = "1994–2014", fill = "Average births" ) + coord_equal() + theme_minimal(base_family = "Montserrat") + theme( legend.justification.bottom = "left", legend.title.position = "top", panel.grid = element_blank(), axis.title.x = element_text(hjust = 0) )

There are some really fascinating stories here!

Nobody wants to have babies during Christmas or New Year’s. Christmas Day, Christmas Eve, and New Year’s Day seem to have the lowest average births.

New Year’s Eve, Halloween, July 4, April 1,⁵ and the whole week of Thanksgiving⁶ also have really low averages.

The 13th of every month has slightly fewer births than average—the column at the 13th is really obvious here.

The days with the highest average counts are in mid-September, from the 9th to the 20th—except for September 11.

⁵ No one wants joke babies?
⁶ American Thanksgiving is the fourth Thursday of November, so the exact day of the month moves around each year
With this data, we can calculate the daily probability of having a birthday:

prob_per_day <- births_combined |> group_by(day_of_year) |> summarize(total = sum(births)) |> mutate(prob = total / sum(total)) |> mutate(full_date = ymd("2024-01-01") + days(day_of_year - 1)) |> mutate(yearless_date = format(full_date, "%B %d")) glimpse(prob_per_day) ## Rows: 366 ## Columns: 5 ## $ day_of_year 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 7… ## $ total 164362, 196481, 228259, 232354, 230826, 229776, 230233, 223766, 224116, 232521, 231414, 230499, 223803, 231343, 222374, 224055, 230038, 229465, 225387, 228138, 228190, 225098, 229163, 233021, 230897, 228394, 228012, 228365, 222831, 226669, 229472, 230364, 230984, 228433, 2298… ## $ prob 0.0019176, 0.0022923, 0.0026631, 0.0027108, 0.0026930, 0.0026808, 0.0026861, 0.0026107, 0.0026147, 0.0027128, 0.0026999, 0.0026892, 0.0026111, 0.0026991, 0.0025944, 0.0026140, 0.0026838, 0.0026771, 0.0026296, 0.0026617, 0.0026623, 0.0026262, 0.0026736, 0.0027186, 0.0026938, 0… ## $ full_date 2024-01-01, 2024-01-02, 2024-01-03, 2024-01-04, 2024-01-05, 2024-01-06, 2024-01-07, 2024-01-08, 2024-01-09, 2024-01-10, 2024-01-11, 2024-01-12, 2024-01-13, 2024-01-14, 2024-01-15, 2024-01-16, 2024-01-17, 2024-01-18, 2024-01-19, 2024-01-20, 2024-01-21, 2024-01-22, 2024-01-23,… ## $ yearless_date "January 01", "January 02", "January 03", "January 04", "January 05", "January 06", "January 07", "January 08", "January 09", "January 10", "January 11", "January 12", "January 13", "January 14", "January 15", "January 16", "January 17", "January 18", "January 19", "January 2…

Here are the 5 most common days:⁷

⁷ There’s me on September 19.

prob_per_day |> select(yearless_date, prob) |> slice_max(order_by = prob, n = 5) ## # A tibble: 5 × 2 ## yearless_date prob ## ## 1 September 09 0.00302 ## 2 September 19 0.00301 ## 3 September 12 0.00301 ## 4 September 17 0.00299 ## 5 September 10 0.00299

And the 5 least probable days—Leap Day, Christmas Day, New Year’s Day, Christmas Eve, and July 4th:

prob_per_day |> select(yearless_date, prob) |> slice_min(order_by = prob, n = 5) ## # A tibble: 5 × 2 ## yearless_date prob ## ## 1 February 29 0.000613 ## 2 December 25 0.00162 ## 3 January 01 0.00192 ## 4 December 24 0.00199 ## 5 July 04 0.00216

Using actual birthday probabilities

Instead of drawing random numbers between 0 and 360 from a uniform distribution, we can draw day-of-the-year numbers. This is easy with sample(). Here’s a random 4-person household:

withr::with_seed(1234, { sample(1:366, size = 4, replace = TRUE) }) ## [1] 284 336 101 111

That gives us a uniform probability distribution—all the numbers between 1 and 366 are equally likely. sample() has a prob argument that we can use to feed a vector of probabilities:

withr::with_seed(1234, { sample( prob_per_day$day_of_year, size = 4, replace = TRUE, prob = prob_per_day$prob) }) ## [1] 284 101 111 133

These days of the year now match the actual distribution of birthdays in the United States. If we simulated thousands of birthdays, we’d get more in September, fewer on the 13th of each month, and far fewer around Thanksgiving and Christmas.

We can now update our simulation to use this more realistic distribution of birthdays:

simulate_prob_real <- function(n, num_simulations = 1000) { results <- map_lgl(1:num_simulations, ~{ birthdays <- sample( prob_per_day$day_of_year, size = n, replace = TRUE, prob = prob_per_day$prob ) birthdays_doubled <- sort(c(birthdays, birthdays + 366)) any(map_lgl(1:n, ~ birthdays_doubled[.x + n - 1] - birthdays_doubled[.x] <= (366 / 2))) }) mean(results) }

Here’s the probability of having all the birthdays within the same six months for a household of 4:

withr::with_seed(1234, { simulate_prob_real(4) }) ## [1] 0.495

And 6:

withr::with_seed(1234, { simulate_prob_real(6) }) ## [1] 0.202

And here’s the probability across different household sizes:

sims_real <- tibble(household_size = 2:10) |> mutate(prob_in_arc = map_dbl(household_size, ~simulate_prob_real(.x, 10000))) |> mutate(nice_prob = scales::label_percent(accuracy = 0.1)(prob_in_arc)) ggplot(sims_real, aes(x = factor(household_size), y = prob_in_arc)) + geom_pointrange(aes(ymin = 0, ymax = prob_in_arc), color = clrs[9]) + geom_text(aes(label = nice_prob), nudge_y = 0.07, size = 8, size.unit = "pt") + scale_y_continuous(labels = scales::label_percent()) + labs( x = "Household size", y = "Probability", title = "Probability that all birthdays occur within a\nsingle 6-month span across household size", subtitle = "Based on average daily birth probabilities from 1994–2014", caption = "10,000 simulations; daily probabilities from the CDC and SSA" ) + theme_minimal(base_family = "Montserrat") + theme( panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), plot.caption = element_text(hjust = 0, color = "grey50"), plot.subtitle = element_text(hjust = 0, color = "grey50"), axis.title.x = element_text(hjust = 0), axis.title.y = element_text(hjust = 1) )

In the end, these are all roughly the same as the uniform birthday distribution, but it feels more accurate since the probabilities are based on real-life frequencies.

But most importantly, we didn’t have to do any math to get the right answer. Brute force simulation techniques got us there.

Citation
BibTeX citation:
@online{heiss2024, author = {Heiss, Andrew}, title = {Calculating Birthday Probabilities with {R} Instead of Math}, date = {2024-05-03}, url = {https://www.andrewheiss.com/blog/2024/05/03/birthday-spans-simulation-sans-math/}, doi = {10.59350/r419r-zqj73}, langid = {en} }
For attribution, please cite this work as:
Heiss, Andrew. 2024. “Calculating Birthday Probabilities with R Instead of Math.” May 3, 2024. https://doi.org/10.59350/r419r-zqj73.

Visualizing {dplyr}’s mutate(), summarize(), group_by(), and ungroup() with animations

Andrew Heiss — Thu, 04 Apr 2024 04:00:00 GMT

I’ve used Garrick Aden-Buie’s tidyexplain animations since he first made them in 2018. They’re incredibly useful for teaching—being able to see which rows left_join() includes when merging two datasets, or which cells end up where when pivoting longer or pivoting wider is so valuable. Check them all out—they’re so fantastic:

left_join() animation by Garrick Aden-Buie

One set of animations that I’ve always wished existed but doesn’t is how {dplyr}’s mutate(), summarize(), group_by(), and summarize() work. Unlike other more straightforward {dplyr} functions like filter() and select(), these mutating/summarizing/grouping functions often involve multiple behind-the-scenes steps that are hard to see. There’s even an official term for this kind of workflow: split/apply/combine.

When I teach about group_by() |> summarize(), I end up waving my arms around a lot to explain how group_by() puts rows into smaller, invisible datasets behind the scenes. This works, I guess, but I still find that it can be hard for people to conceptualize. It gets even trickier when explaining how {dplyr} keeps some grouping structures intact after summarizing and what exactly ungroup() does.

So, I finally buckled down and made my own tidyexplain-esque animations with Adobe Illustrator and After Effects.¹

¹ I tried doing it with R and {gganimate} like the original tidyexplain animations, but it was too hard to do with all the multiple grouping, summarizing, and recombining steps—so these are all artisanally handcrafted animations.

Downloads

You can download versions of all seven animations here:

mutate(): MP4, GIF, static PDF, static SVG, static PNG

summarize(): MP4, GIF, static PDF, static SVG, static PNG

group_by() |> ungroup(): MP4, GIF, static PDF, static SVG, static PNG

group_by() |> mutate(): MP4, GIF, static PDF, static SVG, static PNG

group_by(cat1) |> summarize(): MP4, GIF, static PDF, static SVG, static PNG

group_by(cat2) |> summarize(): MP4, GIF, static PDF, static SVG, static PNG

group_by(cat1, cat2) |> summarize(): MP4, GIF, static PDF, static SVG, static PNG

And for fun, here are all the original files:

Original Illustrator files

Original After Effects files

They’re Creative Commons-licensed—do whatever you want with them!

In this post, we’ll use these animations to explain each of these concepts and apply them to data from {palmerpenguins}. Let’s load some packages and data first:

library(tidyverse) library(palmerpenguins) penguins <- penguins |> drop_na()

Adding new columns with mutate()

The mutate() function in {dplyr} adds new columns. It’s not destructive—all our existing data will still be there after you add new columns²

² Unless we use an existing column name inside mutate(), in which case that column will get replaced with the new one.

By default, mutate() sticks the new column on the far right of the dataset (scroll over to the right to see body_mass_kg here):

penguins |> mutate(body_mass_kg = body_mass_g / 1000) ## # A tibble: 333 × 9 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year body_mass_kg ## ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 3.75 ## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 3.8 ## 3 Adelie Torgersen 40.3 18 195 3250 female 2007 3.25 ## 4 Adelie Torgersen 36.7 19.3 193 3450 female 2007 3.45 ## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007 3.65 ## 6 Adelie Torgersen 38.9 17.8 181 3625 female 2007 3.62 ## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007 4.68 ## 8 Adelie Torgersen 41.1 17.6 182 3200 female 2007 3.2 ## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007 3.8 ## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007 4.4 ## # ℹ 323 more rows

We can also control where the new column shows up with either the .before or .after argument:

penguins |> mutate( body_mass_kg = body_mass_g / 1000, .after = island ) ## # A tibble: 333 × 9 ## species island body_mass_kg bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year ## ## 1 Adelie Torgersen 3.75 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 3.8 39.5 17.4 186 3800 female 2007 ## 3 Adelie Torgersen 3.25 40.3 18 195 3250 female 2007 ## 4 Adelie Torgersen 3.45 36.7 19.3 193 3450 female 2007 ## 5 Adelie Torgersen 3.65 39.3 20.6 190 3650 male 2007 ## 6 Adelie Torgersen 3.62 38.9 17.8 181 3625 female 2007 ## 7 Adelie Torgersen 4.68 39.2 19.6 195 4675 male 2007 ## 8 Adelie Torgersen 3.2 41.1 17.6 182 3200 female 2007 ## 9 Adelie Torgersen 3.8 38.6 21.2 191 3800 male 2007 ## 10 Adelie Torgersen 4.4 34.6 21.1 198 4400 male 2007 ## # ℹ 323 more rows

Summarizing with summarize()

The summarize() function, on the other hand, is destructive. It collapses our dataset into a single value and throws away any columns that we don’t use when summarizing.

After using summarize() on the penguins data, we only see three values in one row: average bill length, total penguin weight, and the number of penguins in the dataset. All other columns are gone.

penguins |> summarize( avg_bill_length = mean(bill_length_mm), total_weight = sum(body_mass_g), n_penguins = n() # This returns the number of rows in the dataset ) ## # A tibble: 1 × 3 ## avg_bill_length total_weight n_penguins ## ## 1 44.0 1400950 333

Grouping and ungrouping with group_by() and ungroup()

The group_by() function splits a dataset into smaller subsets based on the values of columns that we specify. Importantly, this splitting happens behind the scenes—you don’t actually ever see the data split up into smaller datasets.³ To undo the grouping and bring all the rows back together, use ungroup().

³ I like to imagine that the data is splitting into smaller groups, Minority Report-style, or like Tony Stark’s JARVIS-enabled HUD.

Importantly, grouping doesn’t actually change the order of the rows in the dataset. If we use group_by() and look at your dataset, it’ll still be in the existing order. The only sign that the data is invisibly grouped is a little Groups: sex [2] note at the top of the output.

penguins |> group_by(sex) ## # A tibble: 333 × 8 ## # Groups: sex [2] ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year ## ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 ## 3 Adelie Torgersen 40.3 18 195 3250 female 2007 ## 4 Adelie Torgersen 36.7 19.3 193 3450 female 2007 ## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007 ## 6 Adelie Torgersen 38.9 17.8 181 3625 female 2007 ## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007 ## 8 Adelie Torgersen 41.1 17.6 182 3200 female 2007 ## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007 ## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007 ## # ℹ 323 more rows

Grouping is fairly useless on its own, but it becomes really powerful when combined with mutate() or summarize().

Mutating within groups

If we use mutate() after grouping, new columns are added to each subset separately. In many cases, you won’t notice any difference between using mutate() on an ungrouped or grouped dataset—you’ll get the same values. For instance, if we use mutate(body_mass_kg = body_mass_g / 1000) on an ungrouped dataset, R will create a column for the whole dataset that divides body_mass_g by 1,000; if we use mutate(body_mass_kg = body_mass_g / 1000) on a grouped dataset, R will create a new column within each of the subsets. Both approaches will generate the same values.⁴

⁴ Using mutate() on the grouped dataset will be a tiiiiiny bit slower because it’s actually running mutate() on each of the groups.
This is actually important if we’re referencing other values within the group. In the example above, we created a new column y that subtracted the smallest value of x from each value of x. When running mutate(y = x - min(x)) on the ungrouped dataset, the smallest value of x is 1, so all the numbers decrease by 1. When running mutate(y = x * 2) on a grouped dataset, though, min(x) refers to the smallest value of x within each of the subsets. Check out this example here: the minimum values in groups A, B, and C are 1, 4, and 7 respectively, so in subset A we subtract 1 from all the values of x, in subset B we subtract 4 from all the values of x, and in subset C we subtract 7 from all the values of x. As a result, the new y column contains 0, 1, and 2 in each of the groups:

Panel data (or time-series cross-sectional data, like the gapminder dataset) is good example of a situation where grouping and mutating is important. For example, we can use lag() to create a new column (lifeExp_previous) that shows the previous year’s life expectancy.⁵

⁵ This is super common with models where you time-shifted variables, like predicting an outcome based on covariates in the previous year.

library(gapminder) gapminder_smaller <- gapminder |> filter(year %in% c(1997, 2002, 2007)) # Only show a few years gapminder_smaller |> mutate(lifeExp_previous = lag(lifeExp), .after = lifeExp) ## # A tibble: 426 × 7 ## country continent year lifeExp lifeExp_previous pop gdpPercap ## ## 1 Afghanistan Asia 1997 41.8 NA 22227415 635. ## 2 Afghanistan Asia 2002 42.1 41.8 25268405 727. ## 3 Afghanistan Asia 2007 43.8 42.1 31889923 975. ## 4 Albania Europe 1997 73.0 43.8 3428038 3193. ## 5 Albania Europe 2002 75.7 73.0 3508512 4604. ## 6 Albania Europe 2007 76.4 75.7 3600523 5937. ## 7 Algeria Africa 1997 69.2 76.4 29072015 4797. ## 8 Algeria Africa 2002 71.0 69.2 31287142 5288. ## 9 Algeria Africa 2007 72.3 71.0 33333216 6223. ## 10 Angola Africa 1997 41.0 72.3 9875024 2277. ## # ℹ 416 more rows

Afghanistan in 1997 has a lagged life expectancy of NA, but that’s fine and to be expected—there’s no row for it to look at and copy the value (i.e. there’s no Afghanistan 1992 row). Afghanistan’s lagged life expectancy in 2002 is the same value as the actual life expectancy in 1997. Great, it worked!⁶

⁶ Technically this isn’t a one-year lag; this is a five-year lag, since the data is spaced every 5 years.
But look at Albania’s lagged life expectancy in 1997—it’s 43.84, which is actually Afghanistan’s 2007 life expectancy! Lagged values bleed across countries here.

If we group the data by country before lagging, though, the lagging happens within each of the subsets, so the first year of every country is missing (since there’s no previous year to look at). Now every country’s 1997 value is NA, since the new column was created separately in each of the smaller behind-the-scenes country-specific datasets:

gapminder_smaller |> group_by(country) |> mutate(lifeExp_previous = lag(lifeExp), .after = lifeExp) ## # A tibble: 426 × 7 ## # Groups: country [142] ## country continent year lifeExp lifeExp_previous pop gdpPercap ## ## 1 Afghanistan Asia 1997 41.8 NA 22227415 635. ## 2 Afghanistan Asia 2002 42.1 41.8 25268405 727. ## 3 Afghanistan Asia 2007 43.8 42.1 31889923 975. ## 4 Albania Europe 1997 73.0 NA 3428038 3193. ## 5 Albania Europe 2002 75.7 73.0 3508512 4604. ## 6 Albania Europe 2007 76.4 75.7 3600523 5937. ## 7 Algeria Africa 1997 69.2 NA 29072015 4797. ## 8 Algeria Africa 2002 71.0 69.2 31287142 5288. ## 9 Algeria Africa 2007 72.3 71.0 33333216 6223. ## 10 Angola Africa 1997 41.0 NA 9875024 2277. ## # ℹ 416 more rows

Summarizing groups with group_by() |> summarize()

While collapsing an entire dataset can be helpful for finding overall summary statistics (e.g. the average, minimum, and maximum values for columns you’re interested in), summarize() is better used with groups. If we use summarize() on a grouped dataset, each subset is collapsed into a single row. This will create different summary values, depending on the groups you use. In this example, grouping by cat1 gives us a summarized dataset with three rows (for a, b, and c):

While here, if we group by cat2, we get a summarized dataset with two rows (for j and k):

If we use group_by() before summarizing the penguins data, we’ll get a column for the group, along with average bill length, total penguin weight, and the number of penguins in each group. As before, all other columns are gone.

We can see summarized values by species:

penguins |> group_by(species) |> summarize( avg_bill_length = mean(bill_length_mm), total_weight = sum(body_mass_g), n_penguins = n() # This returns the number of rows in each group ) ## # A tibble: 3 × 4 ## species avg_bill_length total_weight n_penguins ## ## 1 Adelie 38.8 541100 146 ## 2 Chinstrap 48.8 253850 68 ## 3 Gentoo 47.6 606000 119

…or by sex…

penguins |> group_by(sex) |> summarize( avg_bill_length = mean(bill_length_mm), total_weight = sum(body_mass_g), n_penguins = n() ) ## # A tibble: 2 × 4 ## sex avg_bill_length total_weight n_penguins ## ## 1 female 42.1 637275 165 ## 2 male 45.9 763675 168

…or by any other column.

Grouping by numeric columns

One common mistake is to feed a numeric columns into group_by(), like this:

penguins |> group_by(flipper_length_mm) |> summarize( avg_bill_length = mean(bill_length_mm), total_weight = sum(body_mass_g), n_penguins = n() ) ## # A tibble: 54 × 4 ## flipper_length_mm avg_bill_length total_weight n_penguins ## ## 1 172 37.9 3150 1 ## 2 174 37.8 3400 1 ## 3 176 40.2 3450 1 ## 4 178 39.0 13300 4 ## 5 180 39.8 14900 4 ## 6 181 41.5 24000 7 ## 7 182 39.6 9775 3 ## 8 183 39.2 6625 2 ## 9 184 37.9 25650 7 ## 10 185 38.0 31550 9 ## # ℹ 44 more rows

This technically calculates something, but it’s generally not what you’re looking for. R is making groups for each of the unique values of flipper length and then calculating summaries for those groups. There’s only one penguin with a flipper length of 172 mm; there are 7 with 181 mm. Grouping by a numeric variable can be useful if you want to create a histogram-like table of counts of unique values, but most of the time, you don’t want to do this.

Summarizing multiple groups

We can specify more than one group with group_by(), which will create behind-the-scenes datasets for each unique combination of values in the groups. Here, when group by both cat1 and cat2, we get six groups (a & j, a & k, b & j, b & k, c & j, c & k), which we can then use with mutate() or summarize():

Leftover groupings and ungroup()

Some subtle and interesting things happen when summarizing with multiple groups, though, and they throw people off all the time.

When you use summarize() on a grouped dataset, {dplyr} will automatically ungroup the last of the groups. This happens invisibly when you’re only grouping by one thing. For example, this has three rows, and no Groups: species[3] note at the top:

penguins |> group_by(species) |> summarize(total = n()) ## # A tibble: 3 × 2 ## species total ## ## 1 Adelie 146 ## 2 Chinstrap 68 ## 3 Gentoo 119

When grouping by multiple things, {dplyr} will automatically ungroup the last of the groups (i.e. the right-most group), but keep everything else grouped. This has six rows and is grouped by species (hence the Groups: species [3]), and R gives you an extra message alerting you to the fact that it’s still grouped by something: `summarise()` has grouped output by 'species'.

penguins |> group_by(species, sex) |> summarize(total = n()) ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ## # A tibble: 6 × 3 ## # Groups: species [3] ## species sex total ## ## 1 Adelie female 73 ## 2 Adelie male 73 ## 3 Chinstrap female 34 ## 4 Chinstrap male 34 ## 5 Gentoo female 58 ## 6 Gentoo male 61

The same thing happens in reverse if we switch species and sex. The results here are still grouped by sex:

penguins |> group_by(sex, species) |> summarize(total = n()) ## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument. ## # A tibble: 6 × 3 ## # Groups: sex [2] ## sex species total ## ## 1 female Adelie 73 ## 2 female Chinstrap 34 ## 3 female Gentoo 58 ## 4 male Adelie 73 ## 5 male Chinstrap 34 ## 6 male Gentoo 61

We can use ungroup() to bring the data all the way back together and get rid of the groups:

penguins |> group_by(species, sex) |> summarize(total = n()) |> ungroup() ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ## # A tibble: 6 × 3 ## species sex total ## ## 1 Adelie female 73 ## 2 Adelie male 73 ## 3 Chinstrap female 34 ## 4 Chinstrap male 34 ## 5 Gentoo female 58 ## 6 Gentoo male 61

Alternatively, summarize has a .groups argument that you can use to control what happens to the groups after you summarize. By default, it uses .groups = "drop_last" and gets rid of the right-most group, but you can also drop all the groups (.groups = "drop") and keep all the groups (.groups = "keep"). See? No groups!

penguins |> group_by(species, sex) |> summarize(total = n(), .groups = "drop") ## # A tibble: 6 × 3 ## species sex total ## ## 1 Adelie female 73 ## 2 Adelie male 73 ## 3 Chinstrap female 34 ## 4 Chinstrap male 34 ## 5 Gentoo female 58 ## 6 Gentoo male 61

Experimental different way of grouping and summarizing

With newer versions of {dplyr} there’s a new experimental way to specify groups when summarizing, borrowed from {data.table}. Rather than specify groups in an explicit group_by() function, you can do it inside summarize() with the .by argument:

penguins |> summarize(total = n(), .by = c(species, sex)) ## # A tibble: 6 × 3 ## species sex total ## ## 1 Adelie male 73 ## 2 Adelie female 73 ## 3 Gentoo female 58 ## 4 Gentoo male 61 ## 5 Chinstrap female 34 ## 6 Chinstrap male 34

This automatically ungroups everything when it’s done, so you don’t have any leftover groupings.

Why care about leftover groups?

Lots of the time, you don’t actually need to worry about leftover groupings. If you’re plotting or modeling or doing other stuff with the data, those functions will ignore the groups and work on the whole dataset. For example, I do stuff like calculating and plotting group summaries all the time—plot_data here is still grouped by species after summarizing, but ggplot() doesn’t care:

plot_data <- penguins |> group_by(species, sex) |> summarize(total = n()) ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. # plot_data is grouped by sex, but that doesn't matter here ggplot(plot_data, aes(x = species, y = total, fill = species)) + geom_col() + guides(fill = "none") + facet_wrap(vars(sex))

Leftover groups are very important when you use things like mutate() on the summarized dataset.

Like here, we’ll create a proportion column based on total / sum(total). Because we only grouped by one thing, there are no leftover groupings, so the prop column adds up to 100%:

penguins |> group_by(species) |> summarize(total = n()) |> mutate(prop = total / sum(total)) ## # A tibble: 3 × 3 ## species total prop ## ## 1 Adelie 146 0.438 ## 2 Chinstrap 68 0.204 ## 3 Gentoo 119 0.357

Next, we’ll group by two things, which creates behind-the-scenes datasets for all the six combinations of species and sex. When {dplyr} is done, it ungroups the sex group, but leaves the dataset grouped by species. The prop column no longer adds up to 100%; it adds to 300%. That’s because it calculated total/sum(total) within each species group (so 50% of Adélies are female, 50% are male, etc.)

penguins |> group_by(species, sex) |> summarize(total = n()) |> mutate(prop = total / sum(total)) ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ## # A tibble: 6 × 4 ## # Groups: species [3] ## species sex total prop ## ## 1 Adelie female 73 0.5 ## 2 Adelie male 73 0.5 ## 3 Chinstrap female 34 0.5 ## 4 Chinstrap male 34 0.5 ## 5 Gentoo female 58 0.487 ## 6 Gentoo male 61 0.513

If we reverse the grouping order so that sex comes first, {dplyr} will automatically stop grouping by species and keep the dataset grouped by sex. That means mutate() will work within each sex group, so the prop column here adds to 200%. 44% of female penguins are Adélies, 21% of female penguins are Chinstraps, and 35% of female penguins are Gentoos, and so on.

penguins |> group_by(sex, species) |> summarize(total = n()) |> mutate(prop = total / sum(total)) ## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument. ## # A tibble: 6 × 4 ## # Groups: sex [2] ## sex species total prop ## ## 1 female Adelie 73 0.442 ## 2 female Chinstrap 34 0.206 ## 3 female Gentoo 58 0.352 ## 4 male Adelie 73 0.435 ## 5 male Chinstrap 34 0.202 ## 6 male Gentoo 61 0.363

If we explicitly ungroup before calculating the proportion,⁷ then mutate() will work on the whole dataset instead of sex- or species-specific groups. Here, 22% of all penguins are female Adélies, 10% are female Chinstraps, etc.

⁷ Or use the .groups argument or .by argument in summarize()

penguins |> group_by(sex, species) |> summarize(total = n()) |> ungroup() |> mutate(prop = total / sum(total)) ## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument. ## # A tibble: 6 × 4 ## sex species total prop ## ## 1 female Adelie 73 0.219 ## 2 female Chinstrap 34 0.102 ## 3 female Gentoo 58 0.174 ## 4 male Adelie 73 0.219 ## 5 male Chinstrap 34 0.102 ## 6 male Gentoo 61 0.183

We don’t have to rely on {dplyr}’s automatic ungroup-the-last-grouping feature and we can add our own grouping explicitly later. Like here, {dplyr} stops grouping by sex, which means that the prop column would add to 300%, showing the proportion of sexes within each species. But if we throw in a group_by(sex) before mutate(), it’ll put everything in two behind-the-scenes datasets (male and female) and calculate the proportion of species within each sex. The resulting dataset is still grouped by sex, since mutate() doesn’t drop any groups like summarize():

penguins |> group_by(species, sex) |> summarize(total = n()) |> group_by(sex) |> mutate(prop = total / sum(total)) ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ## # A tibble: 6 × 4 ## # Groups: sex [2] ## species sex total prop ## ## 1 Adelie female 73 0.442 ## 2 Adelie male 73 0.435 ## 3 Chinstrap female 34 0.206 ## 4 Chinstrap male 34 0.202 ## 5 Gentoo female 58 0.352 ## 6 Gentoo male 61 0.363

Citation
BibTeX citation:
@online{heiss2024, author = {Heiss, Andrew}, title = {Visualizing \{Dplyr\}’s Mutate(), Summarize(), Group\_by(), and Ungroup() with Animations}, date = {2024-04-04}, url = {https://www.andrewheiss.com/blog/2024/04/04/group_by-summarize-ungroup-animations/}, doi = {10.59350/d2sz4-w4e25}, langid = {en} }
For attribution, please cite this work as:
Heiss, Andrew. 2024. “Visualizing {Dplyr}’s Mutate(), Summarize(), Group_by(), and Ungroup() with Animations.” April 4, 2024. https://doi.org/10.59350/d2sz4-w4e25.

Demystifying causal inference estimands: ATE, ATT, and ATU

Andrew Heiss — Thu, 21 Mar 2024 04:00:00 GMT
.no-stripe .gt_table tr.odd { --bs-table-striped-bg: transparent; } .gt_footnote { text-align: left !important; }
In my causal inference class, I spend just one week talking about the Rubin causal model and potential outcomes. This view of causality argues that for any kind of intervention (passing a new policy, participation in a nonprofit program, taking a specific kind of medicine, etc.), people will have one of two possible outcomes:

What would happen if they receive the intervention or treatment, and

What would happen if they do not receive the treatment

These two outcomes are potential outcomes. Both are plausible, but only one will happen in real life. These potential outcomes lead to a bunch of different causal estimands we might be interested in, like the average treatment effect.

I give such short shrift to potential outcomes largely because the bulk of the class approaches the idea of causal inference through Judea Pearl-style DAGs instead of potential outcomes. It’s a strange arrangement that I’ve stumbled into: the potential outcomes approach is incredibly popular and widespread in social sciences (particularly in economics), while causal models and DAGs are more popular in fields like epidemiology. For unfathomable reasons, there’s a weird animosity between these two worlds. Judea Pearl regularly needles social scientists on Twitter for not using DAGs and clinging to potential outcomes, while Nobel-winning econometricians decry DAGs. It’s weird.¹

¹ Though this neat paper by a DAG-using economist tries to bridge that gap (Huntington-Klein 2022).
As a social scientist myself, you’d think I’d have embraced the potential outcomes approach, but for whatever reason, it never stuck and it was always confusing to me. When I came across Judea Pearl’s The Book of Why (Pearl and Mackenzie 2020) a few years ago, I fell in love with the world of DAGs. They made sense—far more sense than the weird decompositional algebra behind average treatment effects, average treatment on the treated effects, average treatment on the untreated effects, and so on.

I’m not the only social science convert to DAGs. The general social science methods textbook Counterfactuals and Causal Inference (Morgan and Winship 2014) started popularizing DAGs in 2007, two modern phenomenal econometrics textbooks—The Effect (Huntington-Klein 2021) and Causal Inference: The Mixtape (Cunningham 2021)—feature DAGs throughout (despite that discipline’s weird aversion to them), and the latest version of the fantastic Bayesian Statistical Rethinking (McElreath 2020) uses them extensively.

Despite all these newer DAG-based approaches in social science, in my class, I never really revisit the potential outcomes framework after that one week. We do all sorts of causal effects estimation with DAG-based adjustment through matching and inverse probability weighting, and quasi-experimental design-based approaches like difference-in-differences, regression discontinuity, and instrumental variables, but beyond emphasizing the fact that methods like regression discontinuity and instrumental variables only return local average treatment effects, we don’t really ever talk about ATEs and ATTs and ATUs again.

This has always bugged me.

Beyond introducing the idea that we can’t find individual-level causal effects without a time machine, thinking about potential outcomes is neat, I guess, but not exactly relevant to all the other methods we cover. I know that’s wrong! But that’s how my mental model of these estimands has worked. All these other methods give some sort of general average causal effect, but I’m never sure which exact flavor of causal effect it is (or if the exact flavor matters).

But a newer working paper by Greifer and Stuart (2023) has finally helped me realize why these different estimands matter and what the subtle differences between them are.

So in this post, I’ll extend the basic standard ATE/ATT/ATU example to reflect a more realistic, larger dataset, and I’ll use propensity score weighting to estimate each estimand. I’ll also follow Greifer and Stuart’s example and translate these estimands into policy-relevant English equivalents (they use medical terminology; I’m not that kind of doctor and I work with social science and policy interventions).

But first, I highly recommend reading through their paper really quick. It’s not too long and and it’s not too mathy—it’s succinct and accessible and a good primer for all these estimands.

(And before we start, let’s load some R packages.)

library(tidyverse) library(ggtext) library(ggdag) library(dagitty) library(gt) library(broom) library(marginaleffects) library(WeightIt) # Define a nice color palette from {MoMAColors} # https://github.com/BlakeRMills/MoMAColors clrs <- MoMAColors::moma.colors("ustwo") # Download Mulish from https://fonts.google.com/specimen/Mulish theme_nice <- function() { theme_minimal(base_family = "Mulish") + theme( panel.grid.minor = element_blank(), plot.background = element_rect(fill = "white", color = NA), plot.title = element_text(face = "bold"), axis.title = element_text(face = "bold"), strip.text = element_text(face = "bold"), strip.background = element_rect(fill = "grey80", color = NA), legend.title = element_text(face = "bold") ) } theme_set(theme_nice()) update_geom_defaults("text", list(family = "Mulish", fontface = "plain")) update_geom_defaults("label", list(family = "Mulish", fontface = "plain")) update_geom_defaults(ggdag:::GeomDagText, list(family = "Mulish", fontface = "plain")) update_geom_defaults(ggtext::GeomRichText, list(family = "Mulish", fontface = "plain"))

Quick crash course in potential outcomes

Before getting into the subtle differences between the various potential outcomes-related estimands, it’s helpful to get a general sense for how these things work. So let’s take a super abbreviated crash course in the potential outcomes framework.

Every causal inference textbook ever written will include a table like Table 1 to illustrate potential outcomes. To make this idea a little more mathy, we’ll call the treatment or intervention , the outcome that would happen if treated , and the outcome that would happen if not treated .² We’ll use for the difference between and , or the individual-level causal effect.

² The notation for all this varies wildly across disciplines. Economists call the treatment for mysterious reasons; epidemiologists will often call it ; I’ve seen political science papers call it (which at least makes more sense than or , since “treatment” starts with T). In my class I call it , which follows what a lot of other people do (like this guide to “10 Strategies for Figuring Out if X Caused Y”).

Code

Andrew Heiss's blog

How to use a histogram as a legend in {ggplot2}

Clean and join data

BLS unemployment data

Census geographic data

Map adjustments

Extract interior state borders

Map with horizontal gradient step legend

Map with histogram legend

Map with automatic histogram legend with {legendry}

Bonus! Use points instead of choropleths

Bonus #2! Use a diverging color scheme + nested legend circles

Citation

How to move Crimea from Russia to Ukraine in maps with R

The Natural Earth Project

Natural Earth’s de facto policy

Natural Earth de jure points of view

Relocating Crimea manually with R and {sf}

Identifying the Crimea POLYGON from a POINT

Extracting the Crimea POLYGON from Russia

Adding the Crimea POLYGON to Ukraine

Updating Russia and Ukraine in the full data

The whole game

Moving Crimea with medium resolution (50m) data

Using the adjusted Natural Earth data as GeoJSON in Observable JS

Broken GeoJSON

GeoJSON and ↻ ↺ winding order ↻ ↺

Clean GeoJSON with correct winding order

Alternative data sources

GISCO

Visionscarto

Less automatic sources

Citation

Using USAID data to make fancy world maps with Observable Plot

Working with map data

Get map data

Maps and projections with Observable Plot

Built-in projections

Other projections

Filtering map data and adjusting projections

Removing elements

Quick and dirty cheating method: change the width or height

Built-in projections and domain settings

Other projections and .fitExtent()

Arbitrary areas and .fitExtent()

Working with USAID data

Get USAID data

Connect USAID data to the map data

Improving the map

Fixing labelling issues

Some final tweaks

The full game: Complete final code

Citation

Guide to comparing sample and population proportions with CPS data, both classically and Bayesianly

Nationally representative demographic data

Accessing US Census data

ACS

CPS (and others!)

Getting started

Getting CPS data from the IPUMS website

Finding variables

Pay attention to the details!

Selecting samples

Downloading the data

More reproducible alternative: using the IPUMS API

Loading CPS data

Summarizing CPS data

Weighting

Calculating population-level proportions

Summarizing sample proportions

Testing sample vs. population proportions frequentist-ly

One-sample proportion test for age

Apple Music Wrapped with R

Calculating 2024 play counts with R

Guide to generating and rendering computational markdown content programmatically with Quarto

Why not just use results="asis"?

Building a panel tabset with an inline chunk

Condensed example showing the evolution of a ggplot plot

Condensed example of continent-level mini reports

Continent reports

Other projections and `.fitExtent()`

Arbitrary areas and `.fitExtent()`

Why not just use `results="asis"`?

Adding new columns with `mutate()`

Summarizing with `summarize()`

Grouping and ungrouping with `group_by()` and `ungroup()`

Summarizing groups with `group_by() |> summarize()`

Leftover groupings and `ungroup()`