Andrew Heiss
https://www.andrewheiss.com/atom.html
{{< meta description-meta >}}quarto-1.3.333Mon, 15 May 2023 04:00:00 GMTA guide to Bayesian proportion tests with R and {brms}Andrew Heiss
https://www.andrewheiss.com/blog/2023/05/15/fancy-bayes-diffs-props/index.html

I’ve been working on converting a couple of my dissertation chapters into standalone articles, so I’ve been revisiting and improving my older R code. As part of my dissertation work, I ran a global survey of international NGOs to see how they adjust their programs and strategies when working under authoritarian legal restrictions in dictatorship. In the survey, I asked a bunch of questions with categorical responses (like “How would you characterize your organization’s relationship with your host-country government?”, with the possible responses “Positive,” “Negative”, and “Neither”).

Analyzing this kind of categorical data can be tricky. What I ended up doing (and what people typically do) is looking at the differences in proportions of responses (i.e. is there a substantial difference in the proportion reporting “Positive” or “Negative” in different subgroups?). But doing this requires a little bit of extra work because of how proportions work. We can’t just treat proportions like continuous values. Denominators matter and help determine the level of uncertainty in the proportions. We need to account for these disparate sample sizes underlying the proportions.

Additionally, it’s tempting to treat Likert scale responses^{1} (i.e. things like “strongly agree”, “agree”, “neutral”, etc.) as numbers (e.g. make “strongly agree” a 2, “agree” a 1, “neutral” a 0, and so on), and then find things like averages (e.g. “the average response is a 1.34”), but those summary statistics aren’t actually that accurate. What would a 1.34 mean? A little higher than agree? What does that even mean?

^{1} Pronounced “lick-ert”!

We need to treat these kinds of survey questions as the categories that they are, which requires a different set of analytical tools than just findings averages and running linear regressions. We instead need to use things like frequencies, proportions, contingency tables and crosstabs, and fancier regression like ordered logistic models.

I’m a fan of Bayesian statistical inference—I find it way more intuitive and straightforward than frequentist null hypothesis testing. I first “converted” to Bayesianism back when I first analyzed my dissertation data in 2017 and used the newly-invented {rstanarm} package to calculate the difference in proportions of my various survey responses in Bayesian ways, based on a blog post and package for Bayesian proportion tests by Rasmus Bååth. He used JAGS, though, and I prefer to use Stan, hence my use of {rstanarm} back then.

So, as a reference for myself while rewriting these chapters, and as a way to consolidate everything I’ve learned about Bayesian-flavored proportion tests, here’s a guide to thinking about differences in proportions in a principled Bayesian way. I explore two different questions (explained in detail below). For the first one I’ll be super pedagogical and long-winded, showing how to find differences in proportions with classical frequentist statistical tests, with different variations of Stan code, and with different variations of {brms} code. For the second one I’ll be less pedagogical and just show the code and results.

You’re familiar with {brms} for running Bayesian regression models and {tidybayes} and {ggdist} for manipulating and plotting posterior draws

Wrangling and exploring the data

Before getting started, let’s load all the packages we need and create some helpful functions and variables:

Code

library(tidyverse)# ggplot, dplyr, and friendslibrary(gt)# Fancy tableslibrary(glue)# Easier string interpolationlibrary(scales)# Nicer labeling functionslibrary(ggmosaic)# Mosaic plots with ggplotlibrary(ggpattern)# Pattern fills in ggplotlibrary(patchwork)# Combine plots nicelylibrary(parameters)# Extract model parameters as data frameslibrary(cmdstanr)# Run Stan code from Rlibrary(brms)# Nice frontend for Stanlibrary(tidybayes)# Manipulate MCMC chains in a tidy waylibrary(likert)# Contains the pisaitems data# Use the cmdstanr backend for brms because it's faster and more modern than the# default rstan backend. You need to install the cmdstanr package first# (https://mc-stan.org/cmdstanr/) and then run cmdstanr::install_cmdstan() to# install cmdstan on your computer.options(mc.cores =4, brms.backend ="cmdstanr")# Set some global Stan optionsCHAINS<-4ITER<-2000WARMUP<-1000BAYES_SEED<-1234# Custom ggplot theme to make pretty plots# Get the font at https://fonts.google.com/specimen/Josttheme_nice<-function(){theme_minimal(base_family ="Jost")+theme(panel.grid.minor =element_blank(), plot.title =element_text(family ="Jost", face ="bold"), axis.title =element_text(family ="Jost Medium"), axis.title.x =element_text(hjust =0), axis.title.y =element_text(hjust =1), strip.text =element_text(family ="Jost", face ="bold", size =rel(1), hjust =0), strip.background =element_rect(fill =NA, color =NA))}# Function for formatting axes as percentage pointslabel_pp<-label_number(accuracy =1, scale =100, suffix =" pp.", style_negative ="minus")

Throughout this guide, we’ll use data from the 2009 Programme of International Student Assessment (PISA) survey, which is available in the {likert} package. PISA is administered every three years to 15-year-old students in dozens of countries, and it tracks academic performance, outcomes, and student attitudes cross-nationally.

One of the questions on the PISA student questionnaire (Q25) asks respondents how often they read different types of materials:

The excerpt of PISA data included in the {likert} package only includes responses from Canada, Mexico, and the United States, and for this question it seems to omit data from Canada, so we’ll compare the reading frequencies of American and Mexican students. Specifically we’ll look at how these students read comic books and newspapers. My wife is currently finishing her masters thesis on religious representation in graphic novels, and I’m a fan of comics in general, so it’ll be interesting to see how students in the two countries read these books. We’ll look at newspapers because it’s an interesting category with some helpful variation (reading trends for magazines, fiction, and nonfiction look basically the same in both countries so we’ll ignore them here.)

First we’ll load and clean the data. For the sake of illustration in this post, I collapse the five possible responses into just three:

Rarely = Never or almost never

Sometimes = A few times a year & About once a month

Often = Several times a month & Several times a week

In real life it’s typically a bad idea to collapse categories like this, and I’m not an education researcher so this collapsing is probably bad and wrong, but whatever—just go with it :)

Code

# Load the datadata("pisaitems", package ="likert")# Make a clean wide version of the datareading_wide<-pisaitems%>%# Add ID columnmutate(id =1:n())%>%# Only keep a few columnsselect(id, country =CNT, `Comic books` =ST25Q02, Newspapers =ST25Q05)%>%# Collapse these categoriesmutate(across(c(`Comic books`, Newspapers), ~fct_collapse(.,"Rarely"=c("Never or almost never"),"Sometimes"=c("A few times a year", "About once a month"),"Often"=c("Several times a month", "Several times a week"))))%>%# Make sure the new categories are ordered correctlymutate(across(c(`Comic books`, Newspapers), ~fct_relevel(., c("Rarely", "Sometimes", "Often"))))%>%# Only keep the US and Mexico and get rid of the empty Canada levelfilter(country%in%c("United States", "Mexico"))%>%mutate(country =fct_drop(country))head(reading_wide)## id country Comic books Newspapers## 1 23208 Mexico Sometimes Often## 2 23209 Mexico Rarely Often## 3 23210 Mexico Often Often## 4 23211 Mexico Sometimes Sometimes## 5 23212 Mexico Rarely Often## 6 23213 Mexico Sometimes Sometimes# Make a tidy (long) versionreading<-reading_wide%>%pivot_longer(-c(id, country), names_to ="book_type", values_to ="frequency")%>%drop_na(frequency)head(reading)## # A tibble: 6 × 4## id country book_type frequency## <int> <fct> <chr> <fct> ## 1 23208 Mexico Comic books Sometimes## 2 23208 Mexico Newspapers Often ## 3 23209 Mexico Comic books Rarely ## 4 23209 Mexico Newspapers Often ## 5 23210 Mexico Comic books Often ## 6 23210 Mexico Newspapers Often

Since we created a tidy (long) version of the data, we can use {dplyr} to create some overall group summaries of reading frequency by type of material across countries. Having data in this format, with a column for the specific total (i.e. Mexico + Comic books + Rarely, and so on, or n here) and the overall country total (or total here), allows us to calculate group proportions (n / total):

Code

reading_counts<-reading%>%group_by(country, book_type, frequency)%>%summarize(n =n())%>%group_by(country, book_type)%>%mutate(total =sum(n), prop =n/sum(n))%>%ungroup()reading_counts## # A tibble: 12 × 6## country book_type frequency n total prop## <fct> <chr> <fct> <int> <int> <dbl>## 1 Mexico Comic books Rarely 9897 37641 0.263## 2 Mexico Comic books Sometimes 17139 37641 0.455## 3 Mexico Comic books Often 10605 37641 0.282## 4 Mexico Newspapers Rarely 6812 37794 0.180## 5 Mexico Newspapers Sometimes 12369 37794 0.327## 6 Mexico Newspapers Often 18613 37794 0.492## 7 United States Comic books Rarely 3237 5142 0.630## 8 United States Comic books Sometimes 1381 5142 0.269## 9 United States Comic books Often 524 5142 0.102## 10 United States Newspapers Rarely 1307 5172 0.253## 11 United States Newspapers Sometimes 1925 5172 0.372## 12 United States Newspapers Often 1940 5172 0.375

And we can do a little pivoting and formatting and make this nice little contingency table / crosstab summarizing the data across the two countries:

This table has a lot of numbers and it’s hard to immediately see what’s going on, so we’ll visualize it really quick to get a sense for some of the initial trends. Visualizing this is a little tricky though. It’s easy enough to make a bar chart of the percentages, but doing that is a little misleading. Notice the huge differences in the numbers of respondents here—there are thousands of Mexican students and only hundreds of American students. We should probably visualize the data in a way that accounts for these different denominators, and percentages by themselves don’t do that.

We can use a mosaic plot (also known as a Marimekko chart) to visualize both the distribution of proportions of responses (Rarely, Sometimes, Often) and the proportions of respondents from each country (Mexico and the United States). We can use the {ggmosaic} package to create ggplot-flavored mosaic plots:

Base R works too!

Mosaic plots are actually also built into base R!

Try running plot(table(mtcars$cyl, mtcars$am)).

Code

ggplot(reading)+geom_mosaic(aes(x =product(frequency, country), # Look at combination of frequency by country fill =frequency, alpha =country), divider =mosaic("v"), # Make the plot divide vertically offset =0, color ="white", linewidth =1)+# Add thin white border around boxesscale_x_productlist(expand =c(0, 0.02))+# Make the x-axis take up almost the full panel width, with 2% spacing on each sidescale_y_productlist(expand =c(0, 0))+# Make the y-axis take up the full panel widthscale_fill_manual(values =c("#E51B24", "#FFDE00", "#009DDC"), guide ="none")+scale_alpha_manual(values =c(0.6, 1), guide ="none")+facet_wrap(vars(book_type))+theme_nice()+theme(panel.grid.major.x =element_blank(), panel.grid.major.y =element_blank(),# Using labs(x = NULL) with scale_x_productlist() apparently doesn't# work, so we can turn off the axis title with theme() instead axis.title.x =element_blank(), axis.title.y =element_blank())

There are a few quick things to point out here in this plot. The Mexico rectangles are super tall, since there are so many more respondents from Mexico than from the United States. It looks like the majority of American students only rarely read comic books, while their Mexican counterparts do it either sometimes or often. Mexican students also read newspapers a lot more than Americans do, with “Often” the clear most common category. For Americans, the “Sometimes” and “Often” newspaper responses look about the same.

The questions

Throughout this post, I’ll illustrate the logic of Bayesian proportion tests by answering two questions that I had while looking at the contingency table and mosaic plot above:

Do students in Mexico read comic books more often than students in the United States? (the two yellow cells in the “Often” column)

Do students in the United States vary in their frequency of reading newspapers? (the blue row for newspapers in the United States)

Question 1: Comic book readership in the US vs. Mexico

Estimand

For our first question, we want to know if there’s a substantial difference in the proportion of students who read comic books often in the United States and Mexico, or whether the difference between the two yellow cells is greater than zero:

To be more formal about this, we’ll call this estimand , which is the difference between the two proportions :

When visualizing this, we’ll use colors to help communicate the idea of between-group differences. We’ll get fancier with the colors in question 2, where we’ll look at three sets of pairwise differences, but here we’re just looking at a single pairwise difference (the difference between the US and Mexico), so we’ll use a bit of subtle and not-quite-correct color theory. In kindergarten we all learned the RYB color model, where primary pigments can be mixed to create secondary pigments. For instance

blue + yellow = green.

If we do some (wrong color-theory-wise) algebra, we can rearrange the formula so that

blue − green = yellow

If we make the United States blue and Mexico green, the ostensible color for their difference is yellow. This is TOTALLY WRONG and a lot more complicated according to actual color theory, but it’s a cute and subtle visual cue, so we’ll use it.

These primary colors are a little too bright for my taste though, so let’s artsty them up a bit. We’re looking at data about the US and Mexico, so we’ll use the Saguaro palette from the {NatParksPalettes} package, since Saguaro National Park is near the US-Mexico border and it has a nice blue, yellow, and green.

We’ll use yellow for the difference (θ) between the United States and Mexico.

In classical frequentist statistics, there are lots of ways to test for the significance of a difference in proportions of counts, rates, proportions, and other categorical-related things, like chi-squared tests, Fisher’s exact test, or proportion tests. Each of these tests have special corresponding “flavors” that apply to specific conditions within the data being tested or the estimand being calculated (corrections for sample sizes, etc.). In standard stats classes, you memorize big flowcharts of possible statistical operations and select the correct one for the situation.

Since we want to test the difference between two group proportions, we’ll use R’s prop.test(), which tests the null hypothesis that the proportions in some number of groups are the same:

Our job with null hypothesis significance testing is to calculate a test statistic ( in this case) for , determine the probability of seeing that statistic in a world where is actually 0, and infer whether the value we see could plausibly fit in a world where the null hypothesis is true.

We need to feed prop.test() either a matrix with a column of counts of successes (students who read comic books often) and failures (students who do not read comic books often) or two separate vectors: one of counts of successes (students who read comic books often) and one of counts of totals (all students). We’ll do it both ways for fun.

First we’ll make a matrix of the counts of students from Mexico and the United States, with columns for the counts of those who read often and of those who don’t read often.

prop.test(often_matrix)## ## 2-sample test for equality of proportions with continuity correction## ## data: often_matrix## X-squared = 759, df = 1, p-value <2e-16## alternative hypothesis: two.sided## 95 percent confidence interval:## 0.170 0.189## sample estimates:## prop 1 prop 2 ## 0.282 0.102

This gives us some helpful information. The “flavor” of the formal test is a “2-sample test for equality of proportions with continuity correction”, which is fine, I guess.

We have proportions that are the same as what we have in the highlighted cells in the contingency table (28.17% / 10.19%) and we have 95% confidence interval for the difference. Oddly, R doesn’t show the actual difference in these results, but we can see that difference if we use model_parameters() from the {parameters} package (which is the apparent successor to broom::tidy()?). Here we can see that the difference in proportions is 18 percentage points:

Code

prop.test(often_matrix)%>%model_parameters()## 2-sample test for equality of proportions## ## Proportion | Difference | 95% CI | Chi2(1) | p## --------------------------------------------------------------## 28.17% / 10.19% | 17.98% | [0.17, 0.19] | 759.26 | < .001## ## Alternative hypothesis: two.sided

And finally we have a test statistic: a value of 759.265, which is huge and definitely statistically significant. In a hypothetical world where there’s no difference in the proportions, the probability of seeing a difference of at least 18 percentage points is super tiny (p < 0.001). We have enough evidence to confidently reject the null hypothesis and declare that the proportions of the groups are not the same. With the confidence interval, we can say that we are 95% confident that the interval 0.17–0.189 captures the true population parameter. We can’t say that there’s a 95% chance that the true value falls in this range—we can only talk about the range.

We can also do this without using a matrix by feeding prop.test() two vectors: one with counts of people who read comics often and one with counts of total respondents:

Code

often_values<-reading_counts%>%filter(book_type=="Comic books", frequency=="Often")(n_often<-as.numeric(c(often_values[1, 4], often_values[2, 4])))## [1] 10605 524(n_respondents<-as.numeric(c(often_values[1, 5], often_values[2, 5])))## [1] 37641 5142prop.test(n_often, n_respondents)## ## 2-sample test for equality of proportions with continuity correction## ## data: n_often out of n_respondents## X-squared = 759, df = 1, p-value <2e-16## alternative hypothesis: two.sided## 95 percent confidence interval:## 0.170 0.189## sample estimates:## prop 1 prop 2 ## 0.282 0.102

The results are the same.

Bayesianly with Stan

Why Bayesian modeling?

I don’t like null hypotheses and I don’t like flowcharts.

Regular classical statistics classes teach about null hypotheses and flowcharts, but there’s a better way. In his magisterial Statistical Rethinking, Richard McElreath describes how in legends people created mythological clay robots called golems that could protect cities from attacks, but that could also spin out of control and wreak all sorts of havoc. McElreaths uses the idea of golems as a metaphor for classical statistical models focused on null hypothesis significance testing, which also consist of powerful quasi-alchemical procedures that have to be followed precisely with specific flowcharts:

McElreath argues that this golem-based approach to statistics is incredibly limiting, since (1) you have to choose the right test, and if it doesn’t exist, you have to wait for some fancy statistician to invent it, and (2) you have to focus on rejecting null hypotheses instead of exploring research hypotheses.

For instance, in the null hypothesis framework section above, this was the actual question undergirding the analysis:

In a hypothetical world where (or where there’s no difference between the proportions) what’s the probability that this one-time collection of data fits in that world—and if the probability is low, is there enough evidence to confidently reject that hypothetical world of no difference?

oof. We set our prop.test() golem to work and got a p-value for the probability of seeing the 18 percentage point difference in a world where there’s actually no difference. That p-value was low, so we confidently declared to be statistically significant and not zero. But that was it. We rejected the null world. Yay. But that doesn’t say much about our main research hypothesis. Boo.

Our actual main question is far simpler:

Given the data here, what’s the probability that there’s no difference between the proportions, or that ?

Bayesian-flavored statistics lets us answer this question and avoid null hypotheses and convoluted inference. Instead of calculating the probability of seeing some data given a null hypothesis (), we can use Bayesian inference to calculate the probability of a hypothesis given some data ().

Formal model

So instead of thinking about a specific statistical golem, we can think about modeling the data-generating process that could create the counts and proportions that we see in the PISA data.

Remember that our data looks like this, with n showing a count of the people who read comic books often in each country and total showing a count of the people who responded to the question.

Code

reading_counts%>%filter(book_type=="Comic books", frequency=="Often")## # A tibble: 2 × 6## country book_type frequency n total prop## <fct> <chr> <fct> <int> <int> <dbl>## 1 Mexico Comic books Often 10605 37641 0.282## 2 United States Comic books Often 524 5142 0.102

The actual process for generating that n involved asking thousands of individual, independent people if they read comic books often. If someone says yes, it’s counted as a “success” (or marked as “often”); if they say no, it’s not marked as “often”. It’s a binary choice repeated across thousands of independent questions, or “trials.” There’s also an underlying overall probability for reporting “often,” which corresponds to the proportion of people selecting it.

The official statistical term for this kind of data-generating processes (a bunch of independent trials with some probability of success) is a binomial distribution, and it’s defined like this, with three parameters:

Number of successes (: the number of people responding yes to “often,” or n in our data

Probability of success (): the probability someone says yes to “often,” or the thing we want to model for each country

Number of trials (): the total number of people asked the question, or total in our data

The number of successes and trials are just integers—they’re counts—and we already know those, since they’re in the data. The probability of success is a percentage and ranges somewhere between 0 and 1. We don’t know this value, but we can estimate it with Bayesian methods by defining a prior and a likelihood, churning through a bunch of Monte Carlo Markov Chain (MCMC) simulations, and finding a posterior distribution of .

We can use a Beta distribution to model , since Beta distributions are naturally bound between 0 and 1 and they work well for probability-scale things. Beta distributions are defined by two parameters: (1) or or shape1 in R and (2) or or shape2 in R. See this section of my blog post on zero-inflated Beta regression for way more details about how these parameters work and what they mean.

Super quickly, we’re interested in the probability of a “success” (or where “often” is yes), which is the number of “often”s divided by the total number of responses:

or

We can separate that denominator into two parts:

The and parameters correspond to the counts of successes and failures::

With these two shape parameters, we can create any percentage or fraction we want. We can also control the uncertainty of the distribution by tinkering with the scale of the parameters. For instance, if we think there’s a 40% chance of something happening, this could be represented with and , since . We could also write it as and , since too. Both are centered at 40%, but Beta(40, 60) is a lot narrower and less uncertain.

We’ve already seen the data and know that the proportion of students who read comic books often is 10ish% in the United States and 30ish% in Mexico, but we’ll cheat and say that we think that around 25% of students read comic books often, ± some big amount. This implies something like a Beta(2, 6) distribution (since ), with lots of low possible values, but not a lot of s are above 50%. We could narrow this down by scaling up the parameters (like Beta(20, 60) or Beta(10, 40), etc.), but leaving the prior distribution of wide like this allows for more possible responses (maybe 75% of students in one country read comic books often!).

Okay, so with our Beta(2, 6) prior, we now have all the information we need to specify the official formal model of the data generating process for our estimand without any flowchart golems or null hypotheses:

Basic Stan model

The neat thing about Stan is that it translates fairly directly from this mathematical model notation into code. Here we’ll define three different blocks in a Stan program:

Data that gets fed into the model, or the counts of respondents (or and )

Parameters to estimate, or and

The prior and model for estimating those parameters, or and

props-basic.stan

// Stuff from Rdata {int<lower=0> often_us;int<lower=0> total_us;int<lower=0> often_mexico;int<lower=0> total_mexico;}// Things to estimateparameters {real<lower=0, upper=1> pi_us;real<lower=0, upper=1> pi_mexico;}// Prior and likelihoodmodel { pi_us ~ beta(2, 6); pi_mexico ~ beta(2, 6); often_us ~ binomial(total_us, pi_us); often_mexico ~ binomial(total_mexico, pi_mexico);}

Using {cmdstanr} as our interface with Stan, we first have to compile the script into an executable file:

We can then feed it a list of data and run a bunch of MCMC chains:

Code

often_values<-reading_counts%>%filter(book_type=="Comic books", frequency=="Often")(n_often<-as.numeric(c(often_values[1, 4], often_values[2, 4])))## [1] 10605 524(n_respondents<-as.numeric(c(often_values[1, 5], often_values[2, 5])))## [1] 37641 5142props_basic_samples<-model_props_basic$sample( data =list(often_us =n_often[2], total_us =n_respondents[2], often_mexico =n_often[1], total_mexico =n_respondents[1]), chains =CHAINS, iter_warmup =WARMUP, iter_sampling =(ITER-WARMUP), seed =BAYES_SEED, refresh =0)## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.2 seconds.props_basic_samples$print( variables =c("pi_us", "pi_mexico"), "mean", "median", "sd", ~quantile(.x, probs =c(0.025, 0.975)))## variable mean median sd 2.5% 97.5%## pi_us 0.10 0.10 0.00 0.09 0.11## pi_mexico 0.28 0.28 0.00 0.28 0.29

We have the two proportions—10% and 28%—and they match what we found in the original table and in the frequentist prop.test() results (yay!). Let’s visualize these things:

Code

props_basic_samples%>%gather_draws(pi_us, pi_mexico)%>%ggplot(aes(x =.value, y =.variable, fill =.variable))+stat_halfeye()+# Multiply axis limits by 1.5% so that the right "%" isn't cut offscale_x_continuous(labels =label_percent(), expand =c(0, 0.015))+scale_fill_manual(values =c(clr_usa, clr_mex))+guides(fill ="none")+labs(x ="Proportion of students who read comic books often", y =NULL)+theme_nice()

This plot is neat for a couple reasons. First it shows the difference in variance across these two distributions. The sample size for Mexican respondents is huge, so the average is a lot more precise and the distribution is narrower than the American one. Second, by just eyeballing the plot we can see that there’s definitely no overlap between the two distributions, which implies that the difference (θ) between the two is definitely not zero—Mexican respondents are way more likely than Americans to read comic books often. We can find that difference by taking the pairwise difference between the two with compare_levels() from {tidybayes}, which subtracts one group’s posterior from the other:

Code

basic_diffs<-props_basic_samples%>%gather_draws(pi_us, pi_mexico)%>%ungroup()%>%# compare_levels() subtracts things using alphabetical order, so by default it# would calculate pi_us - pi_mexico, but we want the opposite, so we have to# make pi_us the first levelmutate(.variable =fct_relevel(.variable, "pi_us"))%>%compare_levels(.value, by =.variable, comparison ="pairwise")basic_diffs%>%ggplot(aes(x =.value))+stat_halfeye(fill =clr_diff)+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+labs(x ="Percentage point difference in proportions", y =NULL)+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())

We can also do some Bayesian inference and find the probability that that difference between the two groups is greater than 0 (or a kind of Bayesian p-value, but way more logical than null hypothesis p-values). We can calculate how many posterior draws are bigger than 0 and divide that by the number of draws to get the official proportion.

There’s a 100% chance that that difference is not zero, or a 100% chance that Mexican respondents are way more likely than their American counterparts to read comic books often.

Stan model improvements

This basic Stan model is neat, but we can do a couple things to make it better:

Right now we have to feed it 4 separate numbers (counts and totals for the US and Mexico). It would be nice to just feed it a vector of counts and a vector of totals (or even a whole matrix like we did with prop.test()).

Right now we have to manually calculate the difference between the two groups (0.28 − 0.10). It would be nice to have Stan do that work for us.

We’ll tackle each of these issues in turn.

First we’ll change how the script handles the data so that it’s more dynamic. Now instead of defining explicit variables and parameters as total_us or pi_mexico or whatever, we’ll use arrays and vectors so that we can use any arbitrary number of groups if we want:

props-better.stan

// Stuff from Rdata {int<lower=0> n;array[n] int<lower=0> often;array[n] int<lower=0> total;}// Things to estimateparameters {vector<lower=0, upper=1>[n] pi;}// Prior and likelihoodmodel { pi ~ beta(2, 6);// We could specify separate priors like this// pi[1] ~ beta(2, 6);// pi[2] ~ beta(2, 6); often ~ binomial(total, pi);}

Let’s make sure it works. Note how we now have to feed it an n for the number of countries and vectors of counts for often and total:

props_better_samples<-model_props_better$sample( data =list(n =2, often =c(n_often[2], n_often[1]), total =c(n_respondents[2], n_respondents[1])), chains =CHAINS, iter_warmup =WARMUP, iter_sampling =(ITER-WARMUP), seed =BAYES_SEED, refresh =0)## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.1 seconds.props_better_samples$print( variables =c("pi[1]", "pi[2]"), "mean", "median", "sd", ~quantile(.x, probs =c(0.025, 0.975)))## variable mean median sd 2.5% 97.5%## pi[1] 0.10 0.10 0.00 0.09 0.11## pi[2] 0.28 0.28 0.00 0.28 0.29

It worked and the results are the same! The parameter names are now “pi[1]” and “pi[2]” and we’re responsible for keeping track of which subscripts correspond to which countries, which is annoying, but that’s Stan :shrug:.

Finally, we can modify the script a little more to automatically calculate θ. We’ll add a generated quantities block for that:

props-best.stan

// Stuff from Rdata {int<lower=0> n;array[n] int<lower=0> often;array[n] int<lower=0> total;}// Things to estimateparameters {vector<lower=0, upper=1>[n] pi;}// Prior and likelihoodmodel { pi ~ beta(2, 6); often ~ binomial(total, pi);}// Stuff Stan will calculate before sending back to Rgenerated quantities {real theta; theta = pi[2] - pi[1];}

Working with raw Stan code like that is fun and exciting—understanding the inner workings of these models is really neat and important! But in practice, I rarely use raw Stan. It takes too much pre- and post-processing for my taste (the data has to be fed in a list instead of a nice rectangular data frame; the variable names get lost unless you do some extra programming work; etc.).

Instead, I use {brms} for pretty much all my Bayesian models. It uses R’s familiar formula syntax, it works with regular data frames, it maintains variable names, and it’s just an all-around super-nice-and-polished frontend for working with Stan.

With a little formula finagling, we can create the same beta binomial model we built with raw Stan using {brms}

Counts and trials as formula outcomes

In R’s standard formula syntax, you put the outcome on the left-hand side of a ~ and the explanatory variables on the right-hand side:

You typically feed the model function a data frame with columns for each of the variables included. One neat and underappreciated feature of the glm() function is that you can feed function aggregated count data (instead of long data with lots of rows) by specifying the number of successes and the total number of failures as the outcome part of the formula. This runs something called aggregated logistic regression or aggregated binomial regression.

glm(cbind(n_successes, n_failures)~x, data =whatever, family =binomial)

{brms} uses slightly different syntax for aggregated logistic regression. Instead of the number of failures, it needs the total number of trials, and it doesn’t use cbind(...)—it uses n | trials(total), like this:

Our comic book data is already in this count form, and we have columns for the number of “successes” (number of respondents reading comic books often) and the total number of “trials” (number of respondents reading comic books):

Code

often_comics_only<-reading_counts%>%filter(book_type=="Comic books", frequency=="Often")often_comics_only## # A tibble: 2 × 6## country book_type frequency n total prop## <fct> <chr> <fct> <int> <int> <dbl>## 1 Mexico Comic books Often 10605 37641 0.282## 2 United States Comic books Often 524 5142 0.102

Binomial model with logistic link

Since we have columns for n, total, and country, we can run an aggregated binomial logistic regression model like this:

Code

often_comics_model_logit<-brm(bf(n|trials(total)~country), data =often_comics_only, family =binomial(link ="logit"), chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.1 seconds.

Code

often_comics_model_logit## Family: binomial ## Links: mu = logit ## Formula: n | trials(total) ~ country ## Data: often_comics_only (Number of observations: 2) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept -0.94 0.01 -0.96 -0.91 1.00 4706 3051## countryUnitedStates -1.24 0.05 -1.34 -1.15 1.00 981 1123## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

Because we used a logit link for the binomial family, these results are on the log odds scale, which, ew. We can’t really interpret them directly unless we do some extra math with plogis() (see here for more about how to do that). The logistic-ness of the results is also apparent in the formal mathy model for this approach, which no longer uses a Beta distribution for estimating :

We can still work with percentage point values if we use epred_draws() and a bit of data wrangling, since that automatically back-transforms from log odds to counts (see here for an explanation of how and why). We can convert these posterior counts to a proportion again by dividing each predicted count by the total for each row.

Code

# brms keeps all the original factor/category names, so there's no need for# extra manual work here!draws_logit<-often_comics_model_logit%>%# This gives us counts...epred_draws(newdata =often_comics_only)%>%# ...so divide by the original total to get proportions againmutate(.epred_prop =.epred/total)p1<-draws_logit%>%ggplot(aes(x =.epred_prop, y =country, fill =country))+stat_halfeye()+scale_x_continuous(labels =label_percent(), expand =c(0, 0.015))+scale_fill_manual(values =c(clr_usa, clr_mex))+guides(fill ="none")+labs(x ="Proportion of students who read comic books often", y =NULL)+theme_nice()p2<-draws_logit%>%ungroup()%>%# compare_levels() subtracts things using alphabetical order, so so we have to# make the United States the first levelmutate(country =fct_relevel(country, "United States"))%>%compare_levels(.epred_prop, by ="country")%>%ggplot(aes(x =.epred_prop))+stat_halfeye(fill =clr_diff)+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+labs(x ="Percentage point difference in proportions", y =NULL)+# Make it so the pointrange doesn't get croppedcoord_cartesian(clip ="off")+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())(p1/plot_spacer()/p2)+plot_layout(heights =c(0.785, 0.03, 0.185))

Code

draws_logit%>%group_by(country)%>%median_qi(.epred_prop)## # A tibble: 2 × 7## country .epred_prop .lower .upper .width .point .interval## <fct> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Mexico 0.282 0.277 0.286 0.95 median qi ## 2 United States 0.102 0.0936 0.110 0.95 median qi

Cool cool, all the results are the same as using raw Stan.

Binomial model with identity link

We didn’t set any priors here, and if we want to be good Bayesians, we should. However, given the logit link, we’d need to specify priors on the log odds scale, and I can’t naturally think in logits. I can think about percentages though, which is why I like the Beta distribution for priors for proportions—it just makes sense.

Also, the raw Stan models spat out percentage-point scale parameters—it’d be neat if {brms} could too.

And it can! We just have to change the link function for the binomial family from "logit" to "identity". This isn’t really documented anywhere (I don’t think?), and it feels weird and wrong, but it works. Note how we take the “logit” out of the second line of the model—we’re no longer using a link function:

Doing this works, but there are some issues. The identity link in a binomial model means that the model parameters won’t be transformed to the logit scale and will instead stay on the proportion scale. We’ll get some errors related to MCMC values because the outcome needs to be constrained between 0 and 1, and the MCMC chains will occasionally wander down into negative numbers and make Stan mad. The model will mostly fit if we specify initial MCMC values at 0.1 or something, but it’ll still complain.

Code

often_comics_model_identity<-brm(bf(n|trials(total)~country), data =often_comics_only, family =binomial(link ="identity"), init =0.1, chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## Chain 3 Rejecting initial value:## Chain 3 Error evaluating the log probability at the initial value.## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0175658, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0175658, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Rejecting initial value:## Chain 3 Error evaluating the log probability at the initial value.## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0471304, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0471304, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Rejecting initial value:## Chain 3 Error evaluating the log probability at the initial value.## Chain 3 Exception: binomial_lpmf: Probability parameter[2] is -0.03896, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Exception: binomial_lpmf: Probability parameter[2] is -0.03896, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Rejecting initial value:## Chain 3 Error evaluating the log probability at the initial value.## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0529492, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 3 Exception: binomial_lpmf: Probability parameter[1] is -0.0529492, but must be in the interval [0, 1] (in '/var/folders/17/g3pw3lvj2h30gwm67tbtx98c0000gn/T/RtmpQKMbv4/model-d1884317c1cb.stan', line 36, column 4 to column 44)## Chain 2 finished in 0.0 seconds.## Chain 1 finished in 0.1 seconds.## Chain 3 finished in 0.1 seconds.## Chain 4 finished in 0.1 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.1 seconds.## Total execution time: 0.3 seconds.

Code

often_comics_model_identity## Family: binomial ## Links: mu = identity ## Formula: n | trials(total) ~ country ## Data: often_comics_only (Number of observations: 2) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept 0.28 0.00 0.28 0.29 1.00 3627 2558## countryUnitedStates -0.18 0.00 -0.19 -0.17 1.00 1472 1907## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

One really nice thing about this identity-link model is that the coefficient for countryUnitedStates shows us the percentage-point-scale difference in proportions: −0.18! This is just a regular regression model, so the intercept shows us the average proportion when United States is false (i.e. for Mexico), and the United States coefficient shows the offset from the intercept.

Working with the countryUnitedStates coefficient directly is convenient—there’s no need to divide predicted values by totals or use compare_levels() to find the difference between the United States and Mexico. We have our estimand immediately.

Code

draws_diffs_identity<-often_comics_model_identity%>%gather_draws(b_countryUnitedStates)%>%# Reverse the value since our theta is Mexico - US, not US - Mexicomutate(.value =-.value)draws_diffs_identity%>%ggplot(aes(x =.value))+stat_halfeye(fill =clr_diff)+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+labs(x ="Percentage point difference in proportions", y =NULL)+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())

However, I’m still not entirely happy with it. For one thing, I don’t like all the initial MCMC errors. The model still eventually fit, but I’d prefer it to have a less rocky start. I could probably tinker with more options to get it working, but that’s a hassle.

More importantly, though, is the issue of priors. We still haven’t set any. Also, we’re no longer using a beta-binomial model—this is just regular old logistic regression, which means we’re working with intercepts and slopes. If we use the n | trials(total) ~ country model with an identity link, we’d need to set priors for the intercept and for the difference, which means we need to think about two types of values: (1) the prior average percentage for Mexico and (2) the prior average difference between Mexico and the United States. In the earlier raw Stan model, we set priors for the average percentages for each country and didn’t worry about thinking about the difference. Conceptually, I think this is easier. In my own work, I can think about the prior distributions for specific survey response categories (30% might agree, 50% might disagree, 20% might be neutral), but thinking about differences is less natural and straightforward (there might be a 20 percentage point difference between agree and disagree? that feels weird).

To get percentages for each country and avoid the odd initial value errors and set more natural priors, and ultimately use a beta-binomial model, we can fit an intercept-free model by including a 0 in the right-hand side of the formula. This disables the Mexico reference category and returns estimates for both Mexico and the United States. Now we can finally set a prior too. Here, as I did with the Stan model earlier, I use Beta(2, 6) for both countries, but it could easily be different for each country too. This is one way to force {brms} to essentially use a beta-binomial model, and results in something like this:

Code

often_comics_model<-brm(bf(n|trials(total)~0+country), data =often_comics_only, family =binomial(link ="identity"), prior =c(prior(beta(2, 6), class =b, lb =0, ub =1)), chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.1 seconds.

Code

often_comics_model## Family: binomial ## Links: mu = identity ## Formula: n | trials(total) ~ 0 + country ## Data: often_comics_only (Number of observations: 2) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## countryMexico 0.28 0.00 0.28 0.29 1.00 3581 2914## countryUnitedStates 0.10 0.00 0.09 0.11 1.00 2824 2470## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

The coefficients here represent the average proportions for each country. The main estimand we care about is still the difference between the two, so we need to do a little bit of data manipulation to calculate that, just like we did with the first logit version of the model:

Code

p1<-often_comics_model%>%epred_draws(newdata =often_comics_only)%>%mutate(.epred_prop =.epred/total)%>%ggplot(aes(x =.epred_prop, y =country, fill =country))+stat_halfeye()+scale_x_continuous(labels =label_percent(), expand =c(0, 0.015))+scale_fill_manual(values =c(clr_usa, clr_mex))+guides(fill ="none")+labs(x ="Proportion of students who read comic books often", y =NULL)+theme_nice()p2<-often_comics_model%>%epred_draws(newdata =often_comics_only)%>%mutate(.epred_prop =.epred/total)%>%ungroup()%>%# compare_levels() subtracts things using alphabetical order, so so we have to# make the United States the first levelmutate(country =fct_relevel(country, "United States"))%>%compare_levels(.epred_prop, by =country, comparison ="pairwise")%>%ggplot(aes(x =.epred_prop))+stat_halfeye(fill =clr_diff)+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+labs(x ="Percentage point difference in proportions", y =NULL)+# Make it so the pointrange doesn't get croppedcoord_cartesian(clip ="off")+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())(p1/plot_spacer()/p2)+plot_layout(heights =c(0.785, 0.03, 0.185))

Actual beta-binomial model

Until {brms} 2.17, there wasn’t an official beta-binomial distribution for {brms}, but it was used as the example for creating your own custom family. It is now implemented in {brms} and allows you to define both a mean () and precision () for a Beta distribution, just like {brms}’s other Beta-related models (like zero-inflated, etc.—see here for a lot more about those). This means that we can model both parts of the distribution simultaneously, which is neat, since it allows us to deal with potential overdispersion in outcomes. Paul Bürkner’s original rationale for not including it was that a regular binomial model with a random effect for the observation id also allows you to account for overdispersion, so there’s not really a need for an official beta-binomial family. But in March 2022 the beta_binomial family was added as an official distributional family, which is neat.

We can use it here instead of family = binomial(link = "identity") with a few adjustments. The family uses a different mean/precision parameterization of the Beta distribution instead of the two shapes and , but we can switch between them with some algebra (see this for more details):

By default, is modeled on the log odds scale and is modeled on the log scale, but I find both of those really hard to think about, so we can use an identity link for both parameters like we did before with binomial() to think about counts and proportions instead. This makes it so the parameter measures the standard deviation of the count on the count scale, so a prior like Exponential(1 / 1000) would imply that the precision (or variance-ish) of the count could vary by mostly low numbers, but maybe up to ±5000ish, which seems reasonable, especially since the Mexico part of the survey has so many respondents:

often_comics_model_beta_binomial<-brm(bf(n|trials(total)~0+country), data =often_comics_only, family =beta_binomial(link ="identity", link_phi ="identity"), prior =c(prior(beta(2, 6), class ="b", dpar ="mu", lb =0, ub =1),prior(exponential(0.001), class ="phi", lb =0)), chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.1 seconds.

Separate model for

If we wanted to be super fancy, we could define a completely separate model for the part of the distribution like this, but we don’t need to here:

often_comics_model_beta_binomial<-brm(bf(n|trials(total)~0+country,phi~0+country), data =often_comics_only, family =beta_binomial(link ="identity", link_phi ="identity"), prior =c(prior(beta(2, 6), class ="b", dpar ="mu", lb =0, ub =1),prior(exponential(0.001), dpar ="phi", lb =0)), chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)

Check out the results:

Code

often_comics_model_beta_binomial## Family: beta_binomial ## Links: mu = identity; phi = identity ## Formula: n | trials(total) ~ 0 + country ## Data: often_comics_only (Number of observations: 2) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## countryMexico 0.28 0.03 0.22 0.33 1.00 1786 890## countryUnitedStates 0.11 0.02 0.08 0.16 1.00 1626 1085## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## phi 1031.27 1022.22 38.29 3841.42 1.01 751 958## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

Those coefficients are the group proportions, as expected, and we have a parameter representing the overall variation in counts. The proportions here are a little more uncertain than before, though, which is apparent if we plot the distributions. The distributions have a much wider range now (note that the x-axis now goes all the way up to 60%), and the densities are a lot bumpier and jankier. I don’t know why though! This is weird! I’m probably doing something wrong!

Phew, that was a lot of slow pedagogical exposure. What’s our official estimand? What’s our final answer to the question “Do respondents in Mexico read comic books more often than respondents in the United States?”?

Yes. They most definitely do.

In an official sort of report or article, I’d write something like this:

Students in Mexico are far more likely to read comic books often than students in the United States. On average, 28.2% (between 27.7% and 28.6%) of PISA respondents in Mexico read comic books often, compared to 10.2% (between 9.4% and 11.1%) in the United States. There is a 95% posterior probability that the difference between these proportions is between 17.0 and 18.9 percentage points, with a median of 18.0 percentage points. This difference is substantial, and there’s a 100% chance that the difference is not zero.

Question 2: Frequency of newspaper readership in the US

Now that we’ve got the gist of proportion tests with {brms}, we’ll go a lot faster for this second question. We’ll forgo all frequentist stuff and the raw Stan stuff and just skip straight to the intercept-free binomial model with an identity link.

Estimand

For this question, we want to know the differences in the the proportions of American newspaper-reading frequencies, or whether the differences between (1) rarely and sometimes, (2) sometimes and often, and (3) rarely and often are greater than zero:

We’ll again call this estimand , but have three different versions of it:

We just spent a bunch of time talking about comic books, and now we’re looking at data about newspapers and America. Who represents all three of these things simultaneously? Clark Kent / Superman, obviously, the Daily Planet journalist and superpowered alien dedicated to truth, justice, and the American way a better tomorrow. I found this palette at Adobe Color.

Code

# US newspaper question colorsclr_often<-"#009DDC"clr_sometimes<-"#FFDE00"clr_rarely<-"#E51B24"

Let’s first extract the aggregated data we’ll work with—newspaper frequency in the United States only:

Code

newspapers_only<-reading_counts%>%filter(book_type=="Newspapers", country=="United States")newspapers_only## # A tibble: 3 × 6## country book_type frequency n total prop## <fct> <chr> <fct> <int> <int> <dbl>## 1 United States Newspapers Rarely 1307 5172 0.253## 2 United States Newspapers Sometimes 1925 5172 0.372## 3 United States Newspapers Often 1940 5172 0.375

We’ll define this formal beta-binomial model for each of the group proportions and we’ll use a Beta(2, 6) prior again (so 25% ± a bunch):

We can estimate this model with {brms} using an intercept-free binomial model with an identity link:

Code

freq_newspapers_model<-brm(bf(n|trials(total)~0+frequency), data =newspapers_only, family =binomial(link ="identity"), prior =c(prior(beta(2, 6), class =b, lb =0, ub =1)), chains =CHAINS, warmup =WARMUP, iter =ITER, seed =BAYES_SEED, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.1 seconds.

Code

freq_newspapers_model## Family: binomial ## Links: mu = identity ## Formula: n | trials(total) ~ 0 + frequency ## Data: newspapers_only (Number of observations: 3) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## frequencyRarely 0.25 0.01 0.24 0.26 1.00 4536 2758## frequencySometimes 0.37 0.01 0.36 0.38 1.00 4084 2933## frequencyOften 0.37 0.01 0.36 0.39 1.00 4318 3217## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

It works! These group proportions are the same as what we found in the contingency table:

Code

plot_props_newspaper<-freq_newspapers_model%>%epred_draws(newdata =newspapers_only)%>%mutate(.epred_prop =.epred/total)%>%ggplot(aes(x =.epred_prop, y =frequency, fill =frequency))+stat_halfeye()+scale_x_continuous(labels =label_percent())+scale_fill_manual(values =c(clr_rarely, clr_sometimes, clr_often))+labs(x ="Proportion of frequencies of newspaper reading", y =NULL)+guides(fill ="none")+theme_nice()plot_props_newspaper

We’re interested in our three s, or the posterior differences between each of these proportions. We can again use compare_levels() to find these all at once. If we specify comparison = "pairwise", {tidybayes} will calculate the differences between each pair of proportions.

Code

freq_newspapers_diffs<-freq_newspapers_model%>%epred_draws(newdata =newspapers_only)%>%mutate(.epred_prop =.epred/total)%>%ungroup()%>%compare_levels(.epred_prop, by =frequency)%>%# Put these in the right ordermutate(frequency =factor(frequency, levels =c("Often - Sometimes", "Sometimes - Rarely","Often - Rarely")))freq_newspapers_diffs%>%ggplot(aes(x =.epred_prop, y =fct_rev(frequency), fill =frequency))+stat_halfeye(fill =clr_diff)+geom_vline(xintercept =0, color ="#F012BE")+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+labs(x ="Percentage point difference in proportions", y =NULL)+guides(fill ="none")+theme_nice()

We can also calculate the official median differences and the probabilities that the posteriors are greater than 0:

Code

freq_newspapers_diffs%>%summarize(median =median_qi(.epred_prop, .width =0.95), p_gt_0 =sum(.epred_prop>0)/n())%>%unnest(median)## # A tibble: 3 × 8## frequency y ymin ymax .width .point .interval p_gt_0## <fct> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>## 1 Often - Sometimes 0.00268 -0.0161 0.0214 0.95 median qi 0.615## 2 Sometimes - Rarely 0.119 0.101 0.136 0.95 median qi 1 ## 3 Often - Rarely 0.122 0.104 0.140 0.95 median qi 1

There’s only a 60ish% chance that the difference between often and sometimes is bigger than zero, so there’s probably not an actual difference between those two categories, but there’s a 100% chance that the differences between sometimes and rarely and often and rarely are bigger than zero.

Better fill colors

Before writing up the final official answer, we need to tweak the plot of differences. With the comic book question, we used some overly-simplified-and-wrong color theory and created a yellow color for the difference (since blue − green = yellow). We could maybe do that here too, but we’ve actually used all primary colors in our Superman palette. I don’t know what blue − yellow or yellow − red would be, and even if I calculated it somehow, it wouldn’t be as cutely intuitive as blue minus green.

So instead, we’ll do some fancy fill work with the neat {ggpattern} package, which lets us fill ggplot geoms with multiply-colored patterns. We’ll fill each distribution of s with the combination of the two colors: we’ll fill the difference between often and sometimes with stripes of those two colors, and so on.

We can’t use geom/stat_halfeye() because {tidybayes} does fancier geom work when plotting its density slabs, but we can use geom_density_pattern() to create normal density plots:

Code

freq_newspapers_diffs%>%ggplot(aes(x =.epred_prop, fill =frequency, pattern_fill =frequency))+geom_density_pattern( pattern ="stripe", # Stripes pattern_density =0.5, # Take up 50% of the pattern (i.e. stripes equally sized) pattern_spacing =0.2, # Thicker stripes pattern_size =0, # No border on the stripes trim =TRUE, # Trim the ends of the distributions linewidth =0# No border on the distributions)+geom_vline(xintercept =0, color ="#F012BE")+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+# Set colors for fills and pattern fillsscale_fill_manual(values =c(clr_often, clr_sometimes, clr_often))+scale_pattern_fill_manual(values =c(clr_sometimes, clr_rarely, clr_rarely))+guides(fill ="none", pattern_fill ="none")+# Turn off legendslabs(x ="Percentage point difference in proportions", y =NULL)+facet_wrap(vars(frequency), ncol =1)+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())

Ahhh that’s so cool!

The only thing that’s missing is the pointrange that we get from stat_halfeye() that shows the median and the 50%, 80%, and 95% credible intervals. We can calculate those ourselves and add them with geom_pointinterval():

Code

# Find medians and credible intervalsfreq_newspapers_intervals<-freq_newspapers_diffs%>%group_by(frequency)%>%median_qi(.epred_prop, .width =c(0.5, 0.8, 0.95))plot_diffs_nice<-freq_newspapers_diffs%>%ggplot(aes(x =.epred_prop, fill =frequency, pattern_fill =frequency))+geom_density_pattern( pattern ="stripe", # Stripes pattern_density =0.5, # Take up 50% of the pattern (i.e. stripes equally sized) pattern_spacing =0.2, # Thicker stripes pattern_size =0, # No border on the stripes trim =TRUE, # Trim the ends of the distributions linewidth =0# No border on the distributions)+# Add 50%, 80%, and 95% intervals + mediangeom_pointinterval(data =freq_newspapers_intervals, aes(x =.epred_prop, xmin =.lower, xmax =.upper))+geom_vline(xintercept =0, color ="#F012BE")+# Multiply axis limits by 0.5% so that the right "pp." isn't cut offscale_x_continuous(labels =label_pp, expand =c(0, 0.005))+# Set colors for fills and pattern fillsscale_fill_manual(values =c(clr_often, clr_sometimes, clr_often))+scale_pattern_fill_manual(values =c(clr_sometimes, clr_rarely, clr_rarely))+guides(fill ="none", pattern_fill ="none")+# Turn off legendslabs(x ="Percentage point difference in proportions", y =NULL)+facet_wrap(vars(frequency), ncol =1)+# Make it so the pointrange doesn't get croppedcoord_cartesian(clip ="off")+theme_nice()+theme(axis.text.y =element_blank(), panel.grid.major.y =element_blank())plot_diffs_nice

Perfect!

Final answer to the question

So, given these results, what’s the answer to the question “Do students in the United States vary in their frequency of reading newspapers?”? What are our three s? The frequencies vary, but only between rarely and the other two categories. There’s no difference between sometimes and often

Code

((plot_props_newspaper+labs(x ="Proportions")+facet_wrap(vars("Response proportions")))|plot_spacer()|(plot_diffs_nice+labs(x ="Percentage point differences")))+plot_layout(widths =c(0.475, 0.05, 0.475))

Here’s how I’d write about the results:

Students in the United States tend to read the newspaper at least sometimes, and are least likely to read it rarely. On average, 25.3% of American PISA respondents report reading the newspaper rarely (with a 95% credible interval of between 24.1% and 26.5%), compared to 37.2% reading sometimes (35.9%–38.5%) and 37.5% reading sometimes (36.2%–38.8%).

There is no substantial difference in proportions between those reporting reading newspapers often and sometimes. The posterior median difference is between −1.6 and 2.1 percentage points, with a median of 0.3 percentage points, and there’s only a 61.5% probability that this difference is greater than 0, which implies that the two categories are indistinguishable from each other.

There is a clear substantial difference between the proportion reading newspapers rarely and the other two responses, though. The posterior median difference between sometimes and rarely is between 10.1 and 13.6 percentage points (median = 11.9), while the difference between often and rarely is between 10.4 and 14.0 percentage points (median = 12.2). The probability that each of these differences is greater than 0 is 100%.

Et voila! Principled, easily interpretable, non-golem-based tests of differences in proportions using Bayesian statistics!

]]>rtidyverseggplotbayesbrmsstansurveyscategorical datahttps://www.andrewheiss.com/blog/2023/05/15/fancy-bayes-diffs-props/index.htmlMon, 15 May 2023 04:00:00 GMTMaking Middle Earth maps with RAndrew Heiss
https://www.andrewheiss.com/blog/2023/04/26/middle-earth-mapping-sf-r-gis/index.html
I’ve taught a course on data visualization with R since 2017, and it’s become one of my more popular classes, especially since it’s all available asynchronously online with hours of Creative Commons-licensed videos and materials. One of the most popular sections of the class (as measured by my server logs and by how often I use it myself) is a section on GIS-related visualization, or how to work with maps in {ggplot2}. Nowadays, since the advent of the {sf} package, I find that making maps with R is incredibly easy and fun.

I’m also a huge fan of J. R. R. Tolkien and his entire Legendarium (as evidenced by my previous blog post here simulating Aragorn’s human-scale age based on an obscure footnote in Tolkien’s writings about Númenor).

Back in 2020, as I was polishing up my data visualization course page on visualizing spatial data, I stumbled across a set of shapefiles for Middle Earth, meaning that it was possible to use R and ggplot to make maps of Tolkien’s fictional world. I whipped up a quick example and tweeted about it back then, but then kind of forgot about it.

With Twitter dying, and with my recent read of The Fall of Númenor, Middle Earth maps have been on my mind again, so I figured I’d make a more formal didactic blog post about how to make and play with these maps. So consider this blog post a fun little playground for learning more about doing GIS work with {sf} and ggplot, and learn some neat data visualization tricks along the way.

Thanks to the magic of the {sf} package (“sf” = “simple features”), working with geographic (or GIS) data in R is really really straightforward and fun.

Geographic data is a lot more complex than regular tabular spreadsheet-like data, since it includes information about points (latitude, longitude), paths (a bunch of connected latitudes and longitudes) and areas (a bunch of connected latitudes and longitudes that form a complete shape). Additionally, it has to keep track of units and distances and map projections (or methods for flattening parts of a round globe onto a two-dimensional surface). This kind of data is often stored in shapefile format (though there are alternatives like GeoJSON), which consist of multiple files. For instance, here’s what the 2022 US Census’s shapefile for US states looks like when unzipped:

It has 7 different files, each with different purposes! Fortunately, it’s easy to read all these in with {sf}. Feed the name of the main .shp file to read_sf() and it’ll handle all the other secondary files (like .dbf and .shx and .prj).

It looks like a regular R dataframe, and it is (all the regular dplyr functions work on it), but it has one added part—there’s a special list column at the end named geometry that contains the actual geographic data, and the dataframe has special metadata with details about the map projection. As seen above, the data uses NAD 83. We can change that to any projection we want, though, with st_transform(). To make life a little easier when calculating distances and combining maps later in this post, we’ll convert this US map to the WGS 84 projection, which is what Google Maps (and all GPS systems) use:

{sf} makes it incredibly easy to plot maps too. By relying on the geographic details embedded in the special geometry plot, the geom_sf() function automatically plots the correct kind of data (points, lines, or areas). And since we’re just working with a dataframe, everything in the grammar of graphics paradigm works too. We can map columns to specific aesthetics. For instance, the Census shapefile happened to come with a column named ALAND that measures the total land area in each state in square meters. We can fill by that column and create a choropleth map showing states by size:

Since this is just a regular ggplot geom, all other ggplot things work, like modifying themes, scales, etc. We can also change the projection on-the-fly with coord_sf() (here I use Albers):

Shapefiles are everywhere. They’re one of the de facto standard formats for GIS data, and most government agencies provide them for their jurisdictions (see here for a list of some different sources). You can view and edit them graphically with the free and open source QGIS or with the expensive and industry-standard ArcGIS.

We’ve already seen how to load shapefiles into R with sf::read_sf(), and that works great. But doing that requires that you go and find and download the shapefiles that you want, which can involve hunting through complicated websites. There are also lots of different R packages that let you get shapefiles directly from different websites’ APIs.

For example, we’ve already loaded the 2022 US Census maps by downloading and unzipping the shapefile and using read_sf(). We could have also used the {tigris} package to access the data directly from the Census, like this:

Code

library(tigris)us_states<-states(resolution ="20m", year =2022, cb =TRUE)lower_48<-us_states%>%filter(!(NAME%in%c("Alaska", "Hawaii", "Puerto Rico")))

For world-level data, Natural Earth has incredibly well-made shapefiles. We could download the 1:50m cultural data from their website, unzip it, and load it with read_sf():

Code

# Medium scale data, 1:50m Admin 0 - Countries# Download from https://www.naturalearthdata.com/downloads/50m-cultural-vectors/world_map<-read_sf("ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")%>%filter(iso_a3!="ATA")# Remove Antarctica

Or we can use the {rnaturalearth} package to do the same thing:

Code

library(rnaturalearth)# rerturnclass = "sf" makes it so the resulting dataframe has the special# sf-enabled geometry columnworld_map<-ne_countries(scale =50, returnclass ="sf")%>%filter(iso_a3!="ATA")# Remove Antarctica

Important

Throughout this post, I use {rnaturalearth} for world-level shapefiles and downloaded shapefiles for the US, but that’s just for the sake of illustration. Both can be done with packages or through downloading.

And finally, for fun, here are some examples of different maps and projections and ggplot tinkering. I’m perpetually astounded by how easy it is to plot GIS data with geom_sf()! That geometry list column is truly magical.

Code

library(patchwork)p1<-ggplot()+geom_sf(data =lower_48, fill ="#0074D9", color ="white", linewidth =0.25)+coord_sf(crs =st_crs("EPSG:4269"))+# NAD83labs(title ="NAD83 projection")+theme_void()+theme(plot.title =element_text(hjust =0.5, family ="Overpass Light"))p2<-ggplot()+geom_sf(data =lower_48, fill ="#0074D9", color ="white", linewidth =0.25)+coord_sf(crs =st_crs("ESRI:102003"))+# Alberslabs(title ="Albers projection")+theme_void()+theme(plot.title =element_text(hjust =0.5, family ="Overpass Light"))p3<-ggplot()+geom_sf(data =world_map, fill ="#FF4136", color ="white", linewidth =0.1)+coord_sf(crs =st_crs("EPSG:3395"))+# Mercatorlabs(title ="Mercator projection")+theme_void()+theme(plot.title =element_text(hjust =0.5, family ="Overpass Light"))p4<-ggplot()+geom_sf(data =world_map, fill ="#FF4136", color ="white", linewidth =0.1)+coord_sf(crs =st_crs("ESRI:54030"))+# Robinsonlabs(title ="Robinson projection")+theme_void()+theme(plot.title =element_text(hjust =0.5, family ="Overpass Light"))(p1|p2)/(p3|p4)

Quick reminder: latitude vs. longitude

One last little GIS-related thing before going to Middle Earth. Geographic data doesn’t rely on the standard X/Y Cartesian plane. Instead, it uses latitudes and longitudes. I’ve loved maps and globes all my life, but I can never remember how latitudes and longitudes translate to X and Y, especially since coordinates are often reported as lat, lon, which is technically the reverse of x, y.

I have this graph printed and hanging on my office wall next to my computer and refer to it all the time. It’s my gift to all of you.

Code

point_example<-tibble(x =2, y =1)%>%mutate(label =glue::glue("{x} x, {y} y\n{y} lat, {x} lon"))lat_labs<-tibble(x =-3, y =seq(-2, 3, 1), label ="Latitude")lon_labs<-tibble(x =seq(-2, 3, 1), y =-2, label ="Longitude")ggplot()+geom_point(data =point_example, aes(x =x, y =y), size =5)+geom_label(data =point_example, aes(x =x, y =y, label =label), nudge_y =0.6, family ="Overpass ExtraBold")+geom_text(data =lat_labs, aes(x =x, y =y, label =label), hjust =0.5, vjust =-0.3, family ="Overpass Light")+geom_text(data =lon_labs, aes(x =x, y =y, label =label), hjust =1.1, vjust =-0.5, angle =90, family ="Overpass Light")+geom_hline(yintercept =0)+geom_vline(xintercept =0)+scale_x_continuous(breaks =seq(-2, 3, 1))+scale_y_continuous(breaks =seq(-2, 3, 1))+coord_equal(xlim =c(-3.5, 3), ylim =c(-3, 3))+labs(x =NULL, y =NULL)+theme_minimal()+theme(panel.grid.minor =element_blank(), axis.text =element_blank())

Loading Middle Earth shapefiles

Phew, okay. With that quick overview done, we can start playing with the ME-GIS data. There’s isn’t a pre-built R package for the data, so we’ll need to download the GitHub repository ourselves. I put all the files in a folder named data/ME-GIS relative to this document. I also downloaded the 2022 US Census cartographic boundary files and put them in a folder named data/cb_2022_us_state_20m. If you’re following along, I suggest you do the same.

The ME-GIS project includes tons of different shapefile layers: data for coastline borders, elevation contours, forest boundaries, city locations, and so on. We’ll load a bunch of them here.

Notice the extra iconv(., from = "ISO-8859-1", to = "UTF-8") that I’ve added to each read_sf() call. This is necessary because the original data isn’t stored as Unicode, which is an issue because Tolkien used all sorts of accents (like “Lórien”), and R can choke on these characters. To make life easier, I use iconv() to convert all the character columns in each shapefile from Latin 1 (ISO-8859-1) to Unicode (UTF-8).

Make sure you download and install the Overpass font from Google Fonts if you want to use the custom fonts throughout the post.

coastline<-read_sf("data/ME-GIS/Coastline2.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))contours<-read_sf("data/ME-GIS/Contours_18.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))rivers<-read_sf("data/ME-GIS/Rivers.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))roads<-read_sf("data/ME-GIS/Roads.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))lakes<-read_sf("data/ME-GIS/Lakes.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))regions<-read_sf("data/ME-GIS/Regions_Anno.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))forests<-read_sf("data/ME-GIS/Forests.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))mountains<-read_sf("data/ME-GIS/Mountains_Anno.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))placenames<-read_sf("data/ME-GIS/Combined_Placenames.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))

We’ll also load and process the Census data and Natural Earth data (we did this before in the overivew, but we’ll do it again here in case you skipped that part):

Finally, we’ll define a couple little helper functions to convert between meters and miles (the Middle Earth data is stored as meters), and define some colors that we’ll use in the maps.

Code

miles_to_meters<-function(x){x*1609.344}meters_to_miles<-function(x){x/1609.344}clr_green<-"#035711"clr_blue<-"#0776e0"clr_yellow<-"#fffce3"# Format numeric coordinates with degree symbols and cardinal directionsformat_coords<-function(coords){ns<-ifelse(coords[[1]][2]>0, "N", "S")ew<-ifelse(coords[[1]][1]>0, "E", "W")glue("{latitude}°{ns} {longitude}°{ew}", latitude =sprintf("%.6f", coords[[1]][2]), longitude =sprintf("%.6f", coords[[1]][1]))}

Exploring the different layers

With all these shapefiles loaded, we can experiment with them and see what’s in them. Here’s the coastline:

Code

ggplot()+geom_sf(data =coastline, linewidth =0.25, color ="grey50")

Neat. We can add some rivers and lakes to it:

Code

ggplot()+geom_sf(data =coastline, linewidth =0.25, color ="grey50")+geom_sf(data =rivers, linewidth =0.2, color =clr_blue, alpha =0.5)+geom_sf(data =lakes, linewidth =0.2, color =clr_blue, fill =clr_blue)

The level of detail in these coastlines and borders is incredible. Great work, ME-DEM team!

Let’s add placenames:

Code

ggplot()+geom_sf(data =coastline, linewidth =0.25, color ="grey50")+geom_sf(data =rivers, linewidth =0.2, color =clr_blue, alpha =0.5)+geom_sf(data =lakes, linewidth =0.2, color =clr_blue, fill =clr_blue)+geom_sf(data =placenames, size =0.5)

Ha, that’s less than helpful. There are 785 different placenames in the data. Since the placenames object is just a fancy dataframe, we can filter it and just look at a few of the places: Hobbiton (the Shire), Rivendell, Edoras (capital of Rohan), and Minas Tirith (capital of Gondor). We’ll also add labels with geom_sf_label() and scoot the labels up a bit so that they’re not on top of the points. The geographic data here is measured in meters, so we can specify how many meters we want each label moved up. Because I don’t think in the metric system, and because there are 1,609 meters in a mile and that implies big numbers, I’ll specify the label offset in miles with the miles_to_meters() function we made earlier. We’ll push the labels up by 50 miles (or 80,467 meters):

Code

places<-placenames%>%filter(NAME%in%c("Hobbiton", "Rivendell", "Edoras", "Minas Tirith"))ggplot()+geom_sf(data =coastline, linewidth =0.25, color ="grey50")+geom_sf(data =rivers, linewidth =0.2, color =clr_blue, alpha =0.5)+geom_sf(data =lakes, linewidth =0.2, color =clr_blue, fill =clr_blue)+geom_sf(data =places, size =1)+geom_sf_label(data =places, aes(label =NAME), nudge_y =miles_to_meters(50))

Fancy map of Middle Earth with lots of layers

So far we’ve seen that we can stack up as many geom_sf() layers as we want to combine each of these shapefiles into a single plot, and we can modify each of the layers just like a standard ggplot geom. Here’s a more polished fancy final version of Middle Earth with better colors, fonts, elevation contours, and so on.

Code

places<-placenames%>%filter(NAME%in%c("Hobbiton", "Rivendell", "Edoras", "Minas Tirith"))ggplot()+geom_sf(data =contours, linewidth =0.15, color ="grey90")+geom_sf(data =coastline, linewidth =0.25, color ="grey50")+geom_sf(data =rivers, linewidth =0.2, color =clr_blue, alpha =0.5)+geom_sf(data =lakes, linewidth =0.2, color =clr_blue, fill =clr_blue)+geom_sf(data =forests, linewidth =0, fill =clr_green, alpha =0.5)+geom_sf(data =mountains, linewidth =0.25, linetype ="dashed")+geom_sf(data =places)+geom_sf_label(data =places, aes(label =NAME), nudge_y =miles_to_meters(40), family ="Overpass ExtraBold", fontface ="plain")+theme_void()+theme(plot.background =element_rect(fill =clr_yellow))

Map of just the Shire

That’s a really neat map! We can get even fancier though! Given the quality of the geographic data, we can zoom in and get much more detail for specific regions. For instance, we can zoom in on just the Shire. We can specify a window of coordinates in coord_sf() to zoom the plot, and with actual real world data we could use a tool like this to find those coordinates on a map, but since this is fictional data, it’s a little bit trickier to define specific bounds.

To find the rough bounds of the Shire, we’ll first extract the coordinates for Hobbiton, home of Bilbo and Frodo Baggins. The geographic data is currently stuck in the geometry list column, but we can use map_dbl() from {purrr} to extract the values as numbers:

Alternatively, we can avoid {purrr} stuff and pull out the numbers directly with st_geometry(). I prefer keeping everything inside the dataframe, though, so I typically use {purrr} for this kind of thing.

Those (515948, 1043820) coordinates are the location of Hobbiton and they’re measured in meters. We can add and subtract some amount of meters to each side of the coordinates to build a window around Hobbiton and set the bounds of the map. Here I add 30 miles to the west, 60 miles to the east, 35 miles to the north, and 20 miles to the south of Hobbiton (I figured those out through a bunch of trial and error to get the main features and labels that I wanted to show in the fancier map below)

Using that window of coordinates, we can make the map extra fancy with some more enhancements. The roads data, for instance, includes a column that indicates if a road is primary, secondary, or tertiary, so we can size by road importance. We can also add some neat little annotations, like a compass indicator and a scale marker (using annotation_scale() and annotation_north_arrow() from the {ggspatial} package). We’ll also add a Tolkienesque plot title with the Aniron font:

We’re not done yet though! We can do a lot more GIS-related work with R. Let’s calculate some distances for fun!

In the first half of The Fellowship of the Ring, Frodo, Sam, Merry, and Pippin travel from the Shire to Rivendell. How long of a journey was that?

To figure this out we can extract the coordinates for Rivendell and then find the difference between it and Hobbiton. This doesn’t follow any roads or anything—it’s just as the Nazgûl flies—but it should be fairly accurate.

According to Karen Wynn Fonstad’s Atlas of Middle-earth, though, this should be 458 miles, which is exactly double the amount we just found! Somehow the distances between everything in the shapefiles are halved from regular-Earth miles.

To fix this we can double the distance between Hobbiton and everything else in the dataset, expanding the data from Hobbiton, which now acts like the center of the world

Code

me_scaled<-places%>%filter(NAME%in%c("Hobbiton", "Rivendell", "Edoras", "Minas Tirith"))%>%# Take the existing coordinates, subtract the doubled Hobbiton coordinates,# and add the Hobbiton coordinatesst_set_geometry((st_geometry(.)-st_geometry(hobbiton))*2+st_geometry(hobbiton))# Extract new coordinates from scaled-up versionhobbiton_scaled<-me_scaled%>%filter(NAME=="Hobbiton")rivendell_scaled<-me_scaled%>%filter(NAME=="Rivendell")# Fixed!st_distance(hobbiton_scaled, rivendell_scaled)%>%meters_to_miles()## [,1]## [1,] 458

That’s correct now—it was 458 miles from the Shire to Rivendell.

Sticking Middle Earth in Real Earth

Now that we can work with correct distances, we can sick Middle Earth inside Real Earth to help visualize how far spread out Tolkien’s world is.

In the United States

First, let’s stick a scaled-up version of Middle Earth in the United States. For fun, we’ll put the Shire in the geographic center of the US, and we’ll calculate the coordinates for that with R just to show that it’s possible.

Currently we have a dataset with 49 rows (48 states + DC). We can use the st_centroid() function to find the center of geographic areas, but if we use it on our current data, we’ll get 49 separate centers. So instead, we’ll melt all the states into one big geographic shape with group_by() and summarize() (using summarize() on the geometry column in an sf dataset combines the geographic areas), and then use st_centroid() on that:

Code

# Melt the lower 48 states into one big shape first, then use st_centroid()us_dissolved<-lower_48%>%mutate(country ="US")%>%# Create new column with the country name group_by(country)%>%# Group by that country name columnsummarize()# Collapse all the geographic data into one big blobus_dissolved## Simple feature collection with 1 feature and 1 field## Geometry type: MULTIPOLYGON## Dimension: XY## Bounding box: xmin: -125 ymin: 24.5 xmax: -66.9 ymax: 49.4## Geodetic CRS: WGS 84## # A tibble: 1 × 2## country geometry## <chr> <MULTIPOLYGON [°]>## 1 US (((-68.9 43.8, -68.9 43.8, -68.8 43.8, -68.9 43.9, -68.9 43.9, -68.9 43.8)), ((-71.6...us_center<-us_dissolved%>%st_geometry()%>%# Extract the geometry columnst_centroid()# Find the centerus_center## Geometry set for 1 feature ## Geometry type: POINT## Dimension: XY## Bounding box: xmin: -99 ymin: 39.8 xmax: -99 ymax: 39.8## Geodetic CRS: WGS 84## POINT (-99 39.8)

According to these calculations, the center of the contiguous US is 39.751441°N -98.965620°W. Technically that’s not 100% correct—the true location is at 39.833333°N -98.583333°W, but this is close enough (according to Google, it’s 25 miles off). I’m guessing the discrepancy is due to differences in the shapefile—I’m not using the highest resolution possible, and there might be islands I need to account for (or not account for). Who knows.

Here’s where that is. I’m using the {leaflet} package just for fun here (this post is a showcase of different R-based GIS things, so let’s showcase!):

Code

us_center_plot<-us_dissolved%>%st_centroid()%>%mutate(fancy_coords =format_coords(geometry))%>%mutate(label =glue("<span style='display: block; text-align: center;'><strong>Roughly of the center of the contiguous US</strong>","<br>{fancy_coords}</span>"))leaflet(us_center_plot)%>%setView(lng =st_geometry(us_center_plot)[[1]][1], lat =st_geometry(us_center_plot)[[1]][2], zoom =4)%>%addTiles()%>%addCircleMarkers(label =~htmltools::HTML(label), labelOptions =labelOptions(noHide =TRUE, direction ="top", textOnly =FALSE))

Next, we need to transform the Middle Earth data so that it fits on the US map. We need to do a few things to make this work:

Double all the distances so they match Real World miles

Change the projection of each of the Middle Earth-related datasets to match the projection of lower_48, or WGS 84, or EPSG:4326

Shift the Middle Earth-related datasets so that Hobbiton aligns with the center of the US.

Changing the projection of an {sf}-enabled dataset is super easy with st_transform(). Let’s first transform the CRS for the Hobbiton coordinates:

Code

hobbiton_in_us<-hobbiton%>%st_transform(st_crs(lower_48))hobbiton_in_us%>%st_geometry()## Geometry set for 1 feature ## Geometry type: POINT## Dimension: XY## Bounding box: xmin: 3.15 ymin: 9.44 xmax: 3.15 ymax: 9.44## Geodetic CRS: WGS 84## POINT (3.15 9.44)

Note how the coordinates are now on the decimal degrees scale (3.15, 9.44) instead of the meter scale (515948, 1043820). That’s how the US map is set up, so now we can do GIS math with the two maps.

Next, we need to calculate the offset from the center of the US and Hobbiton by finding the difference between the two sets of coordinates:

Now we can use that offset to redefine the geometry column in any Middle Earth-related {sf}-enabled dataset we have. Here’s the process for the places data—it’ll be the same for any of the other shapefiles.

Code

me_places_in_us<-places%>%# Make the Middle Earth data match the US projectionst_transform(st_crs(lower_48))%>%# Just look at a handful of placesfilter(NAME%in%c("Hobbiton", "Rivendell", "Edoras", "Minas Tirith"))%>%# Double the distancesst_set_geometry((st_geometry(.)-st_geometry(hobbiton_in_us))*2+st_geometry(hobbiton_in_us))%>%# Shift everything around so that Hobbiton is in the center of the USst_set_geometry(st_geometry(.)+me_to_us)%>%# All the geometry math made us lose the projection metadata; set it against_set_crs(st_crs(lower_48))

We can now stick this US-transformed set of place locations insde a map of the US. (Note the ±70000 values for nudging. I have no idea what scale these are on—they’re not meters or miles (maybe feet? maybe decimal degrees?). I had to tinker with different values until it looked okay.)

Assuming the Shire is in the middle of Kansas, Rivendell would be near the Mississippi River in Missouri. Rohan is down in southern Arkansas, while Gondor is in southern Alabama.

We could be even fancier and reshift all the Middle Earth shapefiles to fit in the US, and then overlay all of Middle Earth on the US, but I won’t do that here. I’ll just stick the coastline on so we can compare the relative sizes of the US and Middle Earth:

Sticking Middle Earth in the US makes sense because I live in the US, so these relative distances are straightforward to me. (I’m in Georgia, which is the middle of Mordor in the maps above).

But Tolkien was from England and lived in Oxford—at 20 Northmoor Road to be precise, or at 51.771004°N -1.260142°W to be even more precise (I found this by going to Google Maps, right clicking on Tolkien’s home, and copying the coordinates). Here’s where that is:

We can put Hobbiton in Tolkien’s home and then see the relative distances to the rest of Middle Earth from Oxford.

We’ll use the Natural Earth data that we loaded at the beginning of this post. We could theoretically filter it to only look at European countries, since it includes a column for continent, but doing so causes all sorts of issues:

Russia is huuuuge

French Guiana is officially part of France, so the map includes a part of South America

Other countries like Denmark, Norway, and the UK have similar overseas province-like territories, so the map gets even more expanded

We could do some fancy filtering and use more detailed data that splits places like France into separate subdivisions (i.e. one row for continental Europe France, one row for French Guiana, etc.), but that’s a lot of work. So instead, we’ll use coord_sf() to define a window so we can zoom in on just a chunk of Europe. Before, we added some arbitrary number of miles around the coordinates for Hobbiton. This time we’ll use a helpful tool from OpenStreetMap that lets you draw a bounding box on a world map to get coordinates to work with:

We can then create a little matrix of coordinates. We’re ultimately going to use the PTRA08 / LAEA Europe projection, which is centered in Portugal and is a good Europe-centric projection, so we’ll convert the list of coordinates to that projection.

Code

europe_window<-st_sfc(st_point(c(-12.4, 29.31)), # left (west), bottom (south)st_point(c(44.74, 64.62)), # right (east), top (north) crs =st_crs("EPSG:4326")# WGS 84)%>%st_transform(crs =st_crs("EPSG:5633"))%>%# LAEA Europe, centered in Portugalst_coordinates()europe_window## X Y## [1,] 2135398 1019399## [2,] 5912220 5020959

Now we can plot the full world map data and use coord_sf() to limit it to just this window:

Neat. Now that we know how to zoom in on Europe, we can go through the same process we did with the US—we’ll convert the Middle Earth shapefiles to the European projection, center Hobbiton on Tolkien’s home in Oxford, double all the distances, and shift everything around.

With Hobbiton in Oxford, Rivendell is in north central Germany (near Hanover?), with Rohan in Switzerland and Gondor on the border of Croatia and Bosnia.

Things I want to do someday but am not smart enough to do

And there’s our quick tour of {sf} and Middle Earth! It’s incredible how much GIS-related stuff you can do with R, and the plots are all beautiful thanks to the magic of ggplot!

Paths

It would be really cool to be able to plot the pathways different characters took in each of the books (Bilbo and Thorin’s company; Frodo and Sam; Aragorn, Legolas, and Gimli, etc.). This data exists! The LOTR Project has detailed maps with the pathways of all of the main characters’ journeys. However, it’s not (as far as I can tell) open source or Creative Commons-licensed, and I don’t think the coordinates are directly comparable to the shapefiles from the ME-GIS project. Alas.

However, the maps aren’t as detailed as the ME-GIS project, and they’re on a completely different scale. For example, here’s the island of Númenor (featured in Amazon’s The Rings of Power). I downloaded the shapefiles from their GitHub repository—the Second Age shapefiles are buried in QGIS/second age/arda2

Here I use st_bbox() to create a bounding box of coordinates that I then use to crop the underlying data. This is different from what we did with Europe, where we plotted the whole world map and then zoomed in on just a chunk of western Europe. Here, st_crop() cuts out the geographic data that doesn’t fall within the box (similar to filtering).

Code

numenor_box<-st_bbox(c(xmin =0.007, xmax =0.017, ymin =-0.025, ymax =-0.015))numenor_outlines<-read_sf("data/Arda-Maps/QGIS/second age/arda2/poly_outline.shp")%>%filter(name=="Numenor")numenor_rivers<-read_sf("data/Arda-Maps/QGIS/second age/arda2/line_river.shp")%>%st_crop(numenor_box)numenor_cities<-read_sf("data/Arda-Maps/QGIS/second age/arda2/point_city.shp")%>%st_crop(numenor_box)ggplot()+geom_sf(data =numenor_outlines, fill ="#F2CB9B")+geom_sf(data =numenor_rivers, linewidth =0.4, color =clr_blue)+geom_sf(data =numenor_cities)+# Use geom_label_repel with the geometry column!ggrepel::geom_label_repel( data =numenor_cities, aes(label =eventname, geometry =geometry), stat ="sf_coordinates", seed =1234, family ="Overpass ExtraBold")+annotation_scale(location ="tl", bar_cols =c("grey30", "white"), text_family ="Overpass", unit_category ="imperial")+annotation_north_arrow( location ="tl", pad_y =unit(1.5, "lines"), style =north_arrow_fancy_orienteering(fill =c("grey30", "white"), line_col ="grey30", text_family ="Overpass"))+labs(title ="Númenor")+theme_void()+theme(plot.background =element_rect(fill =clr_yellow), plot.title =element_text(family ="Aniron", size =rel(2), hjust =0.02))

The map looks fantastic! But notice the scale bar in the top left corner—in this data, Númenor is only a couple thousand feet wide—less than half a mile. The distances are all way wrong. I could probably scale it up by comparing the projection distances in the Arda Maps’ version of regular Middle Earth with the ME-GIS project’s version of regular Middle Earth and then do some fancy math, but that goes beyond my skills.

Update! Second Age maps scaled to Real World distances!

Just kidding! Scaling stuff up doesn’t go beyond my skills. We’ll do it.

We already converted the data from ME-GIS into Real World miles by doubling all the coordinates centered on Hobbiton, which became the de facto center of the world. We’ll do it again here since all those calculations happened way earlier in this post:

Code

# Load the shapefileplacenames<-read_sf("data/ME-GIS/Combined_Placenames.shp")%>%mutate(across(where(is.character), ~iconv(., from ="ISO-8859-1", to ="UTF-8")))# Pull out Hobbitonhobbiton<-placenames%>%filter(NAME=="Hobbiton")# Double all the distancesplaces_ta_scaled<-placenames%>%# Take the existing coordinates, subtract the doubled Hobbiton coordinates,# and add the Hobbiton coordinatesst_set_geometry((st_geometry(.)-st_geometry(hobbiton))*2+st_geometry(hobbiton))%>%st_set_crs(st_crs(placenames))# Confirm that it's 458 miles between Hobbiton and Rivendellhobbiton_ta<-places_ta_scaled%>%filter(NAME=="Hobbiton")rivendell_ta<-places_ta_scaled%>%filter(NAME=="Rivendell")# Use set_units() just for fun since st_distance returns the units as metadatast_distance(hobbiton_ta, rivendell_ta)%>%units::set_units("miles")## Units: [miles]## [,1]## [1,] 458

The Second Age data doesn’t have Hobbiton in it since Hobbits didn’t exist yet (you can see the Harfoots, Amazon’s version of proto-Hobbits, in The Rings of Power), but it does have Bree, which is a village near the Shire.

So first, we’ll find the distance between Bree and Rivendell:

Next, let’s see how far apart Bree and Rivendell are in the Arda-Maps Second Age map. We’ll reload the data and convert the projection to use the same CRS as the ME-GIS map so that things are comparable.

Hahaha, in this tiny map, it’s only a sixth of a mile between Bree and Rivendell. Assuming there are 2,000 steps in a mile, that’s only 333 steps, which is just a little more than what Fitbits and Apple Watches try to make you do over the course of an hour.

We need to turn this sixth of a mile into 360 miles, which involves dividing by… something. I always forget how to rescale things, so I find it helpful to write out the algebra for it:

If we multiply everything in the Second Age map data by 2160ish, we should be good. First we’ll get the official, more precise number (since we’re missing decimals in the quick algebra above):

Now we can plot this thing. Since we’re working with a different projection, the bounding box (numenor_box) that we previously made for cropping the shapefiles won’t work. But we can be even more precise by extracting the bounding box from the Númenor outlines and then using that as the cropping box.

Also, we’ll add some more layers to the map for fun, but because this rescaling business can get repetitive and tedious, we’ll make a little function to cut down on repetition.

Code

numenor_outlines<-read_sf("data/Arda-Maps/QGIS/second age/arda2/poly_outline.shp")%>%st_transform(st_crs(placenames))%>%filter(name=="Numenor")# Extract the bounds for Númenor so we can crop everything else with itnumenor_bbox<-st_bbox(numenor_outlines)# Little helper function to scale things from the Second Age to the real worldscale_sa_to_real_world<-function(x){x%>%st_set_geometry((st_geometry(.)-st_geometry(bree_sa))*as.numeric(sa_to_ta_conversion)+st_geometry(bree_sa))%>%st_set_crs(st_crs(placenames))}numenor_outlines_scaled<-numenor_outlines%>%scale_sa_to_real_world()numenor_rivers_scaled<-read_sf("data/Arda-Maps/QGIS/second age/arda2/line_river.shp")%>%st_transform(st_crs(placenames))%>%st_crop(numenor_bbox)%>%scale_sa_to_real_world()numenor_cities_scaled<-read_sf("data/Arda-Maps/QGIS/second age/arda2/point_city.shp")%>%st_transform(st_crs(placenames))%>%st_crop(numenor_bbox)%>%scale_sa_to_real_world()numenor_forests_scaled<-read_sf("data/Arda-Maps/QGIS/second age/arda2/poly_forest.shp")%>%st_transform(st_crs(placenames))%>%st_crop(numenor_bbox)%>%scale_sa_to_real_world()numenor_highlands_scaled<-read_sf("data/Arda-Maps/QGIS/second age/arda2/poly_highland.shp")%>%st_transform(st_crs(placenames))%>%st_crop(numenor_bbox)%>%scale_sa_to_real_world()numenor_regions_scaled<-read_sf("data/Arda-Maps/QGIS/second age/arda2/poly_region.shp")%>%st_transform(st_crs(placenames))%>%st_crop(numenor_bbox)%>%scale_sa_to_real_world()

Mapping time!

Code

ggplot()+# Background of the islandgeom_sf(data =numenor_outlines_scaled, linewidth =0, fill ="#F2CB9B")