Andrew Heiss
https://www.andrewheiss.com/atom.html
Andrew Heiss is an assistant professor at Georgia State University, researching international NGOs and teaching data science & economics.quarto-1.3.211Tue, 21 Mar 2023 04:00:00 GMTHow old was Aragorn in regular human years?Andrew Heiss
https://www.andrewheiss.com/blog/2023/03/21/aragorn-dunedan-numenorean-simulation/index.html
In The Two Towers, while talking with Eowyn, Aragorn casually mentions that he’s actually 87 years old.

When Aragorn is off running for miles and miles and fighting orcs and trolls and Uruk-hai and doing all his other Lord of the Rings adventures, he hardly behaves like a regular human 87-year-old. How old is he really?

It turns out that Tolkien left us a clue in some of his unfinished writings about Númenor, and we can use that information to make some educated guesses about Aragon’s actual human-scale age. In this post I’ll (1) look at Tolkien’s Númenórean years → human years conversion system and (2) extrapolate that system through two types of statistical simulation—(a) just drawing random numbers and (b) Bayesian modeling—to make some predictions about a range of possible ages.

But first, some context about why Aragon is so old!

Super quick crash course in the Ages of Arda + Númenor

The Lord of the Rings occurs at the end of the Third Age of the world. In Arda (the whole world Tolkien created) there are four recorded ages, plus some pre-game stuff:

The First Age: After the creation, the Elves all lived in a place outside of the main world called Valinor until the main bad guy, a fallen Ainu named Morgoth (aka Melkor) destroyed key parts of it and divided the Elves so that a bunch fled to the far west of Middle-earth to a land called Beleriand. This starts the First Age, which is mostly about the Elves of Beleriand and their battles with Morgoth/Melkor. Sauron (the main bad guy of The Lord of the Rings) serves as Morgoth’s lieutenant and they destroy a ton of cities (with armies of balrogs and orcs) and kill a ton of elves. The elves eventually win, but all of Beleriand sinks into the ocean and the elves either go back to Valinor (technically just outside of Valinor, which serves as elvish heaven), or to eastern Middle-earth. This is all covered in the bulk of The Silmarillion.

The Second Age: While the First Age is mostly about elves, mortal men eventually show up in Beleriand and they play a key role in the battle against Morgoth. They also intermingle with the immortal elves, sometimes falling in love—including the famous Lúthien and Beren (who are stand-ins for Tolkien and his wife Edith). Beren and Lúthien had kids, and their kids had kids, and so on until two half-elf brothers were born at the end of the First Age: Elrond (the same Elrond from The Hobbit and The Lord of the Rings and The Rings of Power) and Elros.

As Beleriand sinks, Elrond and Elros escape to eastern Middle-earth and are then given a choice of how to proceed with their futures. Elrond decides to become immortal like an elf; Elros decides to become mortal like a man, eventually building Rivendell. The gods reward Elros and the men who helped the elves against Morgoth by creating a utopic island in the middle of the ocean between Valinor and Middle-earth named Númenor. The gods also granted them super long life (400+ years, as we’ll explore below), but imposed a strict ban on them—the Númenóreans were forbidden from ever sailing west toward Valinor. Thus begins the two parallel stories of the Second Age—(1) Elrond and other refugees from Valinor doing stuff in Middle-earth, and (2) Elros and his descendants doing stuff on the island of Númenor. This is all covered in a final short part of The Silmarillion (the Akallabêth), and in random appendices in The Lord of the Rings, and in the newer The Fall of Númenor, and in Amazon’s TV series The Rings of Power.

Númenor and Middle-earth hum along happily for three thousand years until Sauron (who fled to Middle-earth after Morgoth was destroyed) shows up. He disguises himself as a super affable friendly dude who everyone loves and then sows chaos. He visits the elves and convinces them to make a bunch of rings, and then he heads to Númenor to convince them to violate the ban and sail to Valinor. The Númenóreans do, they get in trouble, the gods sink their island, and Númenórean refugees flee to Middle-earth, led by Elendil and his sons Isildur and Anárion, who set up a Númenórean kingdom in Gondor. Sauron starts using his fancy new One Ring and tries to conquer Middle-earth; there’s a big war against him (see the first few minutes of The Fellowship of the Ring); Sauron kills Elendil; Isildur cuts off Sauron’s finger and makes him lose the ring, which destroys him; Isildur keeps the ring; he gets killed and the kingdom of Gondor falls, thus ending the Second Age.

The Third Age: The ring disappears for a few thousand years until Gollum picks it up, then Bilbo gets it in The Hobbit, and then Frodo gets it in The Fellowship of the Ring and destroys it in The Return of the King.

Meanwhile, Gondor is ruled by a series of stewards who are supposed to take care of the kingdom until a Númenórean king returns to take his place. The magic long life of the Númenóreans starts declining, except for some special refugees from Gondor named the Dúnedain (singular Dúnadan), of which Aragorn is one. These Dúnedain maintain some of the magic Númenórean longevity—they don’t live 400+ years like their Númenórean ancestors, but they live well beyond 150. After the ring is destroyed, Aragorn is installed as king of Gondor, all the remaining elves go back to Valinor (except Arwen, Aragorn’s now-wife), and the Third Age ends.

The Fourth Age: Aragorn reigns until he’s 210, then he dies, someone else takes over, and Middle-earth lives happily ever after.

Packages

Before diving into the data and simulations, we need to load some R libraries and make some helper functions. For the sake of narrative here, the code is automatically collapsed—click on the little triangle arrow to show it.

Tolkien’s Númenórean years → normal human years system

In the appendix of the newly published The Fall of Númenor and Other Tales From the Second Age of Middle-earth is a fascinating footnote that explains exactly how to convert Second Age Númenórean years into normal human years. Tolkien writes:

Deduct 20: Since at 20 years a Númenórean would be at about the same stage of development as an ordinary person.

Add to this 20 the remainder divided by 5. Thus a Númenórean man or woman of years [X] would be approximately of the “age” [Y] (Tolkien 2022, “The Life of the Númenóreans,” note 8, p. 262)

And he provides this helpful table, which we’ll stick in an R data frame so we can play with it:

ages|>rename("Númenórean age"=numenor_age,"Normal human age"=normal_human_age)|>t()|>knitr::kable()

Table 1: Tolkien’s original table for converting between Númenórean and human ages

Númenórean age

25

50

75

100

125

150

175

200

225

250

275

300

325

350

375

400

425

Normal human age

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

101

Tolkien’s logic is a little convoluted, but it works. For example, if a Númenórean is 125, subtract 20, then divide the remainder by 5 and add that to 20 to get 41 normal human years:

If we plot all of Tolkien’s example ages, we get a nice linear relationship between Númenórean ages and human ages:

Code

ggplot(ages, aes(x =numenor_age, y =normal_human_age))+geom_point()+labs(x ="Age in Númenórean years", y ="Age in normal human years")+theme_numenor()

Since it’s linear, we can skip the convoluted subtract-20-add-divided-remainder logic and instead figure out a slope and intercept for the line.

Code

age_model<-lm(normal_human_age~numenor_age, data =ages)age_model## ## Call:## lm(formula = normal_human_age ~ numenor_age, data = ages)## ## Coefficients:## (Intercept) numenor_age ## 16.0 0.2

The line starts at the y-axis at 16 normal human years and then increases by 0.2 for every Númenórean year (the 0.2 is that divide-by-5 rule, since 1/5 = 0.2).

Code

ggplot(ages, aes(x =numenor_age, y =normal_human_age))+geom_smooth(method ="lm", color =clrs[3])+geom_point()+labs(x ="Age in Númenórean years", y ="Age in normal human years")+annotate(geom ="richtext", x =300, y =40, label ="Normal human years =<br>16 + (0.2 × Númenórean years)")+theme_numenor()

Estimating the Dúnedain → normal human age system

Aragorn was a Dúnedan, or a descendant of the Númenórean refugees Elendil and Isildur, so he inherited their unnaturally long lifespan. Aragorn was 87 at the end of the Third Age, and he lived until he was 210. If we naively assume he lived as long as a standard Númenórean, he would have gone through the events of The Lord of the Rings at 33 and died surprisingly young at 58:

Code

ggplot(ages, aes(x =numenor_age, y =normal_human_age))+geom_smooth(method ="lm", color =clrs[3])+geom_point()+geom_vline(xintercept =c(87, 210))+annotate(geom ="segment", x =-Inf, xend =87, y =33, yend =33, linewidth =0.5, linetype ="21", color ="grey50")+annotate(geom ="segment", x =-Inf, xend =210, y =58, yend =58, linewidth =0.5, linetype ="21", color ="grey50")+geom_image(data =tibble(numenor_age =87, normal_human_age =80, image ="img/aragorn-alive.jpg"),aes(image =image), size =0.15, asp =1.618)+annotate(geom ="richtext", x =87, y =62, label =glue("<span style='color:{clrs[1]};'>87 Númenórean years</span>","<br>","<span style='color:{clrs[6]};'>33 human years</span>"))+geom_image(data =tibble(numenor_age =210, normal_human_age =80, image ="img/aragorn-dead.jpg"),aes(image =image), size =0.15, asp =1.618)+annotate(geom ="richtext", x =210, y =98, label =glue("<span style='color:{clrs[1]};'>210 Númenórean years</span>","<br>","<span style='color:{clrs[6]};'>58 human years</span>"))+labs(x ="Age in Númenórean years", y ="Age in normal human years")+theme_numenor()+theme(axis.text.x =element_text(color =clrs[1]), axis.title.x =element_text(color =clrs[1]), axis.text.y =element_text(color =clrs[6]), axis.title.y =element_text(color =clrs[6]))

But the refugee Númenóreans (the Dúnedain) gradually lost their long-living powers after Númenor was destroyed, so this Númenórean formula doesn’t really apply to Aragorn.

According to supplemental Tolkien writings, the 7th steward of Gondor was the last person to live to 150 years, and by the time of the events of The Lord of the Rings, nobody in Gondor had lived past 100 years since Belecthor II, the 21st steward of Gondor. Denethor II—the tomato-massacring father of Boromir and Faramir—was the 26th steward and took the position 112 years after Belecthor II died. So it had been a long time since anyone had lived that long. With the exception of Aragorn and the other Dúnedan Rangers of the North, all the magic Númenórean power had waned.

After the Ring was destroyed, something seems to have changed, though. Faramir—likely distantly related to the Dúnedain—lived until 120, so something new was in the air at the beginning of the post-Sauron Fourth Age. Aragorn—an actual Dúnedan and descendant of Númenor—made it to 210, but he maybe had some special elf help from Arwen.

So given that some of the old Númenórean power seems to have returned (and given that Aragorn was unnaturally long-lived), we can try to figure out the Dúnedan → regular human age conversion three different ways.

Arbitrary maximum age

Since the Númenórean → regular human age line is perfectly linear, we’ll assume that the Dúnedan → regular human age line is also perfectly linear, just scaled down. We’ll start the line at 16 again, but now we’ll pretend that 210 years in Fourth Age Dúnedan years is 100 in human years (Aragorn lived a stunningly long time).

To figure out the equation for the new Dúnedan line, we need to figure out the slope. We have two points ( and ) that we can use to calculate the slope:

When Aragorn tells Eowyn that he’s 87, that’s actually the equivalent of 51ish. This fits with Tolkien’s writings, since in The Fellowship of the Ring, Aragorn was nearing the prime of life (Tolkien 2012, bk. 1, ch. 10, “Strider”).

Here’s what that looks like across the whole hypothetical Dúnedan lifespan:

Code

ggplot(ages_dunedain, aes(x =dunedain_age, y =normal_human_age))+geom_smooth(method ="lm", color =clrs[7])+geom_vline(xintercept =c(87, 210))+annotate(geom ="segment", x =-Inf, xend =87, y =50.8, yend =50.8, linewidth =0.5, linetype ="21", color ="grey50")+annotate(geom ="segment", x =-Inf, xend =210, y =100, yend =100, linewidth =0.5, linetype ="21", color ="grey50")+geom_image(data =tibble(dunedain_age =87, normal_human_age =15, image ="img/aragorn-alive.jpg"),aes(image =image), size =0.15, asp =1.618)+annotate(geom ="richtext", x =87, y =41, label =glue("<span style='color:{clrs[3]};'>87 Dúnedan years</span>","<br>","<span style='color:{clrs[6]};'>50.8 human years</span>"))+geom_image(data =tibble(dunedain_age =210, normal_human_age =60, image ="img/aragorn-dead.jpg"),aes(image =image), size =0.15, asp =1.618)+annotate(geom ="richtext", x =210, y =86, label =glue("<span style='color:{clrs[3]};'>210 Dúnedan years</span>","<br>","<span style='color:{clrs[6]};'>100 human years</span>"))+labs(x ="Age in Dúnedan years", y ="Age in normal human years")+coord_cartesian(xlim =c(0, 240), ylim =c(0, 110))+theme_numenor()+theme(axis.text.x =element_text(color =clrs[3]), axis.title.x =element_text(color =clrs[3]), axis.text.y =element_text(color =clrs[6]), axis.title.y =element_text(color =clrs[6]))

Simulating a bunch of slopes

Deciding that 210 Dúnedan years was 100 human years was a pretty arbitrary choice. Maybe Aragorn lived to be the equivalent of 90? Or 80? Or 120 like Faramir?

Instead of choosing one single endpoint, we can simulate the uncertainty around the final age at death. We’ll say that 210 years in Fourth Age-era Dúnedan years is the equivalent of somewhere between 80 and 120 regular human years.

Code

# Generate a bunch of maximum human ages, centered around 100, ± 20ishlots_of_slopes<-tibble(max_human_age =rnorm(1000, 100, 10))%>%# Find the slope of each of these new linesmutate(slope =map_dbl(max_human_age, ~find_slope(c(0, 16), c(210, .x))))%>%# Generate data for each of the new linesmutate(ages =map(slope, ~{tibble(dunedain_age =seq(20, 210, by =1))|>mutate(normal_human_age =16+.x*dunedain_age)}))%>%mutate(id =1:n())%>%unnest(ages)

Each of these simulated lines is a plausible age conversion formula.

Code

lots_of_slopes%>%ggplot(aes(x =dunedain_age, y =normal_human_age))+geom_line(aes(group =id), method ="lm", stat ="smooth", alpha =0.05, color =clrs[5])+geom_smooth(method ="lm", color =clrs[7])+geom_vline(xintercept =c(87, 210))+labs(x ="Age in Dúnedan years", y ="Age in normal human years")+theme_numenor()+theme(axis.text.x =element_text(color =clrs[3]), axis.title.x =element_text(color =clrs[3]), axis.text.y =element_text(color =clrs[6]), axis.title.y =element_text(color =clrs[6]))

At lower values of Dúnedan ages, there’s a lot less of a range of uncertainty, but as Dúnedan age increases, so too does the range of possible human ages. We can look at the distribution of the predicted normal human ages at both 87 and 210 to get a sense for these ranges:

Code

lots_of_slopes|>filter(dunedain_age%in%c(87, 210))|>mutate(dunedain_age =glue("{dunedain_age} Dúnedan years"), dunedain_age =fct_inorder(dunedain_age))|>ggplot(aes(x =normal_human_age, fill =dunedain_age))+stat_halfeye()+scale_fill_manual(values =c(clrs[3], clrs[4]), guide ="none")+labs(x ="Normal human age", y ="Density", fill =NULL)+facet_wrap(vars(dunedain_age), scales ="free_x")+theme_numenor()+theme(panel.grid.major.y =element_blank())

Bayesian simulation

As a final approach for guessing at Aragorn’s age, we’ll use a Bayesian model to generate a posterior distribution of plausible conversion lines and predicted ages. Technically this isn’t a true posterior—there’s no actual data or anything, so we’ll sample just from the prior distributions that we feed the model. But it’s still a helpful exercise in simulation.

We’ll define this statistical model:

We fix the intercept at 16 as before, and we say that the slope is around 0.4 ± 0.1ish. We’ll use Stan (through brms) to fit a model based on just these priors:

Code

# Stan likes to work with mean-centered variables, so we'll center dunedain_age# here, so that 0 represents 115ages_dunedain_centered<-ages_dunedain|>mutate(dunedain_age =scale(dunedain_age, center =TRUE, scale =FALSE))# Set some priorspriors<-c(prior(constant(16), class =Intercept), # Constant 16 for the interceptprior(normal(0.4, 0.05), class =b, coef ="dunedain_age"), # Slope of 0.4 ± 0.1prior(exponential(1), class =sigma))# Run some MCMC chains just with the priors, since we don't have any actual dataage_model_bayes<-brm(bf(normal_human_age~dunedain_age), data =ages_dunedain_centered, prior =priors, sample_prior ="only", chains =4, cores =4, backend ="cmdstanr", seed =1234, refresh =0)## Start sampling## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.0 seconds.## Chain 2 finished in 0.0 seconds.## Chain 3 finished in 0.0 seconds.## Chain 4 finished in 0.0 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.0 seconds.## Total execution time: 0.2 seconds.age_model_bayes## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: normal_human_age ~ dunedain_age ## Data: ages_dunedain_centered (Number of observations: 191) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;## total post-warmup draws = 4000## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept 16.00 0.00 16.00 16.00 NA NA NA## dunedain_age 0.40 0.05 0.30 0.50 1.00 2309 2097## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## sigma 1.01 1.00 0.03 3.75 1.00 2059 1413## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

The results from the model aren’t too surprising, given (1) we’ve seen similar results with the other methods, and (2) the 16 intercept and 0.4 slope match the priors, wince we’re only dealing with priors.

Now that we have a “posterior” (again, it’s not a true posterior since there’s no actual data), we can play with it in a few different ways. First we can look at the whole range of Dúnedan ages and see lots of plausible slopes. As expected, most are around 0.4, resulting in a final age of 100, but some lines are steeper and some are shallower. Since these are posterior distributions, we can find credible intervals too (which we can interpret much more naturally than convoluted confidence intervals):

Code

draws_prior<-tibble(dunedain_age =seq(25, 210, 1))|>add_epred_draws(age_model_bayes, ndraws =500)draws_prior|>ggplot(aes(x =dunedain_age, y =.epred))+geom_line(aes(group =.draw), alpha =0.05, color =clrs[5])+geom_vline(xintercept =c(87, 210))+labs(x ="Age in Dúnedan years", y ="Age in normal human years")+theme_numenor()+theme(axis.text.x =element_text(color =clrs[3]), axis.title.x =element_text(color =clrs[3]), axis.text.y =element_text(color =clrs[6]), axis.title.y =element_text(color =clrs[6]))

We can also look at the posterior distribution of predicted human ages at just 87 and 210, along with credible intervals. There’s a 95% chance that at 87, he’s actually between 42 and 59 (with an average of 51ish), and at 210 he’s actually between 79ish and 121.

draws_aragorn_ages|>mutate(dunedain_age =glue("{dunedain_age} Dúnedan years"), dunedain_age =fct_inorder(dunedain_age))|>ggplot(aes(x =.epred, fill =factor(dunedain_age)))+stat_halfeye()+scale_fill_manual(values =c(clrs[3], clrs[4]), guide ="none")+labs(x ="Normal human age", y ="Density", fill =NULL)+facet_wrap(vars(dunedain_age), scales ="free_x")+theme_numenor()+theme(panel.grid.major.y =element_blank())

Conclusion

Given all the evidence we have about Númenórean ages, and after making some reasonable assumptions about Dúnedan and human lifespans, when Aragorn tells Eowyn that that he’s 87, that’s really the equivalent of 50ish, with a 95% chance that he’s somewhere between 42 and 59.

References

Tolkien, J. R. R. 2012. The Fellowship of the Ring: Being the First Part of the Lord of the Rings. The Lord of the Rings, pt. 1. Boston: William Morrow & Company.

———. 2022. The Fall of Númenor and Other Tales from the Second Age of Middle-Earth. Edited by Brian Sibley. 1st ed. New York, NY: William Morrow.

]]>rtidyverseggplotsimulationsbrmsnerderyhttps://www.andrewheiss.com/blog/2023/03/21/aragorn-dunedan-numenorean-simulation/index.htmlTue, 21 Mar 2023 04:00:00 GMTOne Simple Trick™ to create inline bibliography entries with Markdown and pandocAndrew Heiss
https://www.andrewheiss.com/blog/2023/01/09/syllabus-csl-pandoc/index.html
Pandoc-flavored Markdown makes it really easy to cite and reference things. You can write something like this (assuming you use this references.bib BibTeX file):

---title: "Some title"bibliography: references.bib---According to @Lovelace:1842, computers can calculate things. This was important during World War II [@Turing:1936].

And it’ll convert to this after running the document through pandoc:

Rendered document

Some title

According to Lovelace (1842), computers can calculate things. This was important during World War II (Turing 1936).

References

Lovelace, Augusta Ada. 1842. “Sketch of the Analytical Engine Invented by Charles Babbage, by LF Menabrea, Officer of the Military Engineers, with Notes Upon the Memoir by the Translator.” Taylor’s Scientific Memoirs 3: 666–731.

Turing, Alan Mathison. 1936. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Journal of Math 58 (345-363): 230–65.

This is all great and ideal when working with documents that have a single bibliography at the end.

The limits of default in-text citations

Some documents—like course syllabuses and readings lists—don’t have a final bibliography. Instead they have lists of things people should read. However, if you try to insert citations like normal, you’ll get the inline references and a final bibliography:

Keynes, John Maynard. 1937. “The General Theory of Employment.” The Quarterly Journal of Economics 51 (2): 209–23.

Lovelace, Augusta Ada. 1842. “Sketch of the Analytical Engine Invented by Charles Babbage, by LF Menabrea, Officer of the Military Engineers, with Notes Upon the Memoir by the Translator.” Taylor’s Scientific Memoirs 3: 666–731.

Turing, Alan Mathison. 1936. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Journal of Math 58 (345-363): 230–65.

The full citations are all in the document, but not in a very convenient location. Readers have to go to the back of the document to see what they actually need to read (especially if there’s a website or DOI URL they need to click on).

Making note-based styles appear in the text

It would be great if the full citation could be included in the lists in the document instead of at the end of the document.

The easiest way to get full citations inline is to find a CSL that uses note-based citations, like the Chicago full note style and edit the CSL file to tell it to be an inline style instead of a note style.

The second line of all CSL files contains a <style> XML element with a class attribute. Inline styles like APA and Chicago author date have class="in-text":

If you download a note-based CSL style and manually change it to be in-text, the footnotes that it inserts will get inserted in the text itself instead of as foonotes.

Here I downloaded Chicago full note, edited the second line to say class="in-text", and saved it as chicago-syllabus.csl:

<?xml version="1.0" encoding="utf-8"?><style xmlns="http://purl.org/net/xbiblio/csl" class="in-text" version="1.0" demote-non-dropping-particle="display-and-sort" page-range-format="chicago"> <info> <title>Chicago Manual of Style 17th edition (full note, but in-text)</title> ...

I can then tell pandoc to use that CSL when rendering the document:

…and the full references are included in the document itself!

Rendered document

Some course syllabus

Course schedule

Week 1

Augusta Ada Lovelace, “Sketch of the Analytical Engine Invented by Charles Babbage, by LF Menabrea, Officer of the Military Engineers, with Notes Upon the Memoir by the Translator,” Taylor’s Scientific Memoirs 3 (1842): 666–731.

Alan Mathison Turing, “On Computable Numbers, with an Application to the Entscheidungsproblem,” Journal of Math 58, no. 345-363 (1936): 230–65.

Week 2

John Maynard Keynes, “The General Theory of Employment,” The Quarterly Journal of Economics 51, no. 2 (1937): 209–23.

References

Keynes, John Maynard. “The General Theory of Employment.” The Quarterly Journal of Economics 51, no. 2 (1937): 209–23.

Lovelace, Augusta Ada. “Sketch of the Analytical Engine Invented by Charles Babbage, by LF Menabrea, Officer of the Military Engineers, with Notes Upon the Memoir by the Translator.” Taylor’s Scientific Memoirs 3 (1842): 666–731.

Turing, Alan Mathison. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Journal of Math 58, no. 345-363 (1936): 230–65.

A few minor tweaks to perfect the output

This isn’t quite perfect, though. There are three glaring problems with this:

We have a bibliography at the end, since Chicago notes-bibliography requires it. This makes sense for regular documents where you have footnotes throughout the body of the text with a list of references at the end, but it’s not necessary here.

The in-text references all have hyperlinks to their corresponding references in the final bibliography. We don’t need those since the linked text is the bibliography.

If you render this in Quarto, you get helpful popups that contain the full reference when you hover over the link. But again, the link is the full reference, so that extra hover information is redundant.

All these problems are easy to fix with some additional YAML settings that suppress the final bibliography, turn off citation links, and disable Quarto’s hovering:

Augusta Ada Lovelace, “Sketch of the Analytical Engine Invented by Charles Babbage, by LF Menabrea, Officer of the Military Engineers, with Notes Upon the Memoir by the Translator,” Taylor’s Scientific Memoirs 3 (1842): 666–731.

Alan Mathison Turing, “On Computable Numbers, with an Application to the Entscheidungsproblem,” Journal of Math 58, no. 345-363 (1936): 230–65.

Week 2

John Maynard Keynes, “The General Theory of Employment,” The Quarterly Journal of Economics 51, no. 2 (1937): 209–23.

Using other styles

This is all great and super easy if you (like me) are fond of Chicago. What if you want to use APA, though? Or MLA? Or any other style that doesn’t use footnotes?

For APA, you’re in luck! There’s an APA (curriculum vitae) CSL style that you can use, and you don’t need to edit it beforehand—it just works:

Lovelace, A. A. (1842). Sketch of the analytical engine invented by Charles Babbage, by LF Menabrea, officer of the military engineers, with notes upon the memoir by the translator. Taylor’s Scientific Memoirs, 3, 666–731.

Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem. Journal of Math, 58(345-363), 230–265.

Week 2

Keynes, J. M. (1937). The general theory of employment. The Quarterly Journal of Economics, 51(2), 209–223.

For any other style though, you’re (somewhat) out of luck. The simple trick of switching class="note" to class="in-text" doesn’t work if the underlying style is already in-text like APA or Chicago author-date. You’d have to do some major editing and rearranging in the CSL file to force the bibliography entries to show up as inline citations, which goes beyond my skills.

]]>writingmarkdowncitationspandoczoterohttps://www.andrewheiss.com/blog/2023/01/09/syllabus-csl-pandoc/index.htmlMon, 09 Jan 2023 05:00:00 GMTHow to migrate from BibDesk to Zotero for pandoc-based writingAndrew Heiss
https://www.andrewheiss.com/blog/2023/01/08/bibdesk-to-zotero-pandoc/index.html

My longstanding workflow for writing, citing, and PDF management

When I started my first master’s degree program in 2008, I decided to stop using Word for all my academic writing and instead use plain text Markdown for everything. Markdown itself had been a thing for 4 years, and MultiMarkdown—a pandoc-like extension of Markdown that could handle BibTeX bibliographies—was brand new. I did all my writing for my courses and my thesis in Markdown and converted it all to PDF through LaTeX using MultiMarkdown. I didn’t know about pandoc yet, so I only ever converted to PDF, not HTML or Word.

I stored all my bibliographic references in a tiny little references.bib BibTeX file that I managed with BibDesk. BibDesk is a wonderful and powerful program with an active developer community and it does all sorts of neat stuff like auto-filing PDFs, importing references from DOIs, searching for references on the internet from inside the program, and just providing a nice overall front end for dealing with BibTeX files.

I kept using my MultiMarkdown + LaTeX output system throughout my second master’s degree, and my references.bib file and PDF database slowly grew. R Markdown hadn’t been invented yet and I still hadn’t discovered pandoc, so living in a mostly LaTeX-based world was fine.

When I started my PhD in 2012, something revolutionary happened: the {knitr} package was invented. The new R Markdown format let you to mix R code with Markdown text and create multiple outputs (HTML, LaTeX, and docx) through pandoc. I abandoned MultiMarkdown and fully converted to pandoc (thanks also in part to Kieran Healy’s Plain Person’s Gide to Plain Text Social Science). Since 2012, I’ve written exclusively in pandoc-flavored Markdown and always make sure that I can convert everything to PDF, HTML, and Word (see the “Manuscript” entry in the navigation bar here, for instance, where you can download the preprint version of that paper in a ton of different formats). I recently converted a bunch of my output templates to Quarto pandoc too.

During all this time, I didn’t really keep up with other reference managers. I used super early Zotero as an undergrad back in 2006–2008, but it didn’t fit well with my Markdown-based workflow, so I kind of ignored it. I picked it up again briefly at the beginning of my PhD, but I couldn’t get it to play nicely with R Markdown and pandoc, so I kept using trusty old BibDesk. My references.bib file got bigger and bigger as I took more and more doctoral classes and did more research, but BibDesk handled the growing library just fine. As of today, I’ve got 1,400 items in there with nearly 1,000 PDFs, and everything still works great—mostly.

Why switch away from BibTeX and BibDesk?

BibDesk got me through my dissertation and all my research projects up until now, so why consider switching away to some other system? Over the past few years, as I’ve done more reading on my iPad and worked on more coauthored projects, I’ve run into a few pain points in my citation workflow.

Problem 1: Cross-device reading

I enjoy reading PDFs on my iPad (particularly in the iAnnotate app), but getting PDFs from BibDesk onto the iPad has always required a bizarre dance:

Store references.bib and the BibDesk-managed folder of PDFs in Dropbox

Use the References iPad app to open the BibTeX file from Dropbox on the iPad

Use iAnnotate to navigate Dropbox and find the PDF I want to read

Read and annotate the PDF in iAnnotate

Send the finished PDF from iAnnotate back to Dropbox and go back to References to ensure that the annotated PDF updates

I’d often get sick of this convoluted process and just find the PDF on my computer and AirDrop it to my iPad directly, completely circumventing Dropbox. I’d then AirDrop it back to my computer and attach the marked up PDF to the reference in BibDesk. It’s inconvenient, but less inconvenient than bouncing around a bunch of different apps and hoping everything works.

Problem 2: Collaboration across many projects with many coauthors

Collaboration with a single huge references.bib file is impossible. I could share my Dropbox folder with coauthors, but then they’d see all my entries and have access to all my annotated PDFs, which seems like overkill. As I started working with coauthors, I decided to make smaller project-specific .bib files that would be shareable and editable.

This is great for project modularity—see how this bibliography.bib file only contains things we cited? But it caused major synchronization problems. If me or a coauthor makes any edits to the project-specific files (adding a DOI to an existing entry, adding a new entry, etc.), those changes don’t show up in my big master references.bib file. I have to remember to copy those changes to the main file, and I never remember. With some recent projects, I’ve actually been copying some entries from previous projects’ .bib files rather than from the big references.bib file. Everything’s diverging and it’s a pain.

Problem 3: BibTeX was designed for LaTeX—but just LaTeX

BibTeX works great with LaTeX. That’s why it was invented in the first place! The fact that things like pandoc work with it is partially a historical accident—.bib files were a convenient and widely used plain text bibliography format, so pandoc and MultiMarkdown used BibTeX for citations.

But citations are often more complicated than BibTeX can handle. Consider the LaTeX package biblatex-chicago—in order to be fully compliant with all the intricacies of the Chicago Manual of Style, it has to expand the BibTeX (technically BibLaTeX) format to include fields like entrysubtype for distinguishing between magazine/newspaper articles and journal articles, among dozens of other customizations and tweaks. BibTeX has a limited set of entry types, and anything that’s not one of those types gets shoehorned into the misc type.

Internally, programs like pandoc that can read BibTeX files convert them into a standard Citation Style Language (CSL) format, which it then uses to format references as Chicago, APA, MLA, or whatever. It would be great to store all my citations in a CSL-compliant format in the first place rather than as a LaTeX-only format that has to be constantly converted on-the-fly when converting to any non-LaTeX output.

The solution: Zotero

Zotero conveniently fixes all these issues:

It has a synchronization service that works across platforms (including iOS). It can work with Dropbox too if you don’t want to be bound by their file size limit or pay for extra storage, though I ended up paying for storage to (1) support open source software and (2) not have to deal with multiple programs. I’ve been doing the BibDesk → iAnnotate → Dropbox → MacBook → AirDrop dance for too many years—I just want Zotero to handle all the syncing for me.

It’s super easy to collaborate with Zotero. You can create shared group libraries with different sets of coauthors and not worry about Dropbox synchronization issues or accidental deletion of } characters in the .bib file. For one of my reading-intensive class, I’ve even created a shared Zotero group library that all the students can join and cite from, which is neat.

It’s also far easier to maintain a master list of references. You can create a Zotero collection for specific projects, and items can live in multiple collections. Editing an item in one collection updates that item in all other collections. Zotero treats collections like iTunes/Apple Music playlists—just like songs can belong to multiple playlists, bibliographic entries can belong to multiple collections.

Zotero follows the CSL standard that pandoc uses. It was the first program to adopt CSL (way back in 2006!). It supports all kinds of entry types and fields, beyond what BibTeX supports.

Preparing for the migration

Migrating my big .references.bib file to Zotero was a relatively straightforward process, but it required a few minor shenanigans to get everything working right.

Make a backup

Preparing everything for migration meant I had to make a ton of edits to the original references.bib file, so I made a copy of it first and worked with the copy.

Install extensions

To make Zotero work nicely with a pandoc-centric writing workflow, and to make file management and tag management easier, I installed these three extensions:

BibDesk allows you to add a couple extra metadata fields to entries for ratings and to mark them as read. I’ve used these fields for years and find them super useful for keeping track of how much I like articles and for remembering which ones I’ve actually finished.

Internally, BibDesk stores this data as entries in the raw BibTex:

These fields are preserved and transferred to Zotero when you import the file, but they show up in the “Extra” field and aren’t easily filterable or sortable there:

I decided to treat these as Zotero tags, which BibDesk calls keywords. I considered making some sort of programmatic solution and writing a script to convert all the rating and read fields to keywords, but that seemed like too much work—many entries have existing keywords and parsing the file and concatenating ratings and read status to the list of keywords would be hard.

So instead I sorted all my entries in BibDesk by rating, selected all the 5 star ones and added a zzzzz tag, selected all the 4 star ones and added a zzzz tag, and so on (so that 1 star entries got a z) tag. I then sorted the entries by read status and assigned xxx to all the ones I’ve read. These tag names were just temporary—in Zotero I changed these to emojis (⭐️⭐️⭐️ and ✅), but because I was worried about transferring complex Unicode characters like emojis across programs, I decided to simplify things by temporarily just using ASCII characters.

Files

A note on BibDesk’s stored filename

BibDesk can autofile attached PDFs and manage their location. To keep track of where the files are, it stores their path as a base64-encoded path in a bdsk-file-N field in the .bib file, like this:

@article{HeissKelley:2017,author = {Andrew Heiss and Judith G. Kelley},doi = {10.1086/691218},journal = {Journal of Politics},month = {4},number = {2},pages = {732--41},title = {Between a Rock and a Hard Place: International {NGOs} and the Dual Pressures of Donors and Host Governments},volume = {79},year = {2017},bdsk-file-1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhZYWxpYXNEYXRhXxBcUGFwZXJzL0hlaXNzS2VsbGV5MjAxNyAtIEJldHdlZW4gYSBSb2NrIGFuZCBhIEhhcmQgUGxhY2UgSW50ZXJuYXRpb25hbCBOR09zIGFuZCB0aGUgRHVhbC5wZGZPEQJ8AAAAAAJ8AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAADfgQ51QkQAAf////8fSGVpc3NLZWxsZXkyMDE3IC0gI0ZGRkZGRkZGLnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/////9T5sk0AAAAAAAAAAAABAAMAAAogY3UAAAAAAAAAAAAAAAAABlBhcGVycwACAHwvOlVzZXJzOmFuZHJldzpEcm9wYm94OlJlYWRpbmdzOlBhcGVyczpIZWlzc0tlbGxleTIwMTcgLSBCZXR3ZWVuIGEgUm9jayBhbmQgYSBIYXJkIFBsYWNlIEludGVybmF0aW9uYWwgTkdPcyBhbmQgdGhlIER1YWwucGRmAA4ArABVAEgAZQBpAHMAcwBLAGUAbABsAGUAeQAyADAAMQA3ACAALQAgAEIAZQB0AHcAZQBlAG4AIABhACAAUgBvAGMAawAgAGEAbgBkACAAYQAgAEgAYQByAGQAIABQAGwAYQBjAGUAIABJAG4AdABlAHIAbgBhAHQAaQBvAG4AYQBsACAATgBHAE8AcwAgAGEAbgBkACAAdABoAGUAIABEAHUAYQBsAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgB6VXNlcnMvYW5kcmV3L0Ryb3Bib3gvUmVhZGluZ3MvUGFwZXJzL0hlaXNzS2VsbGV5MjAxNyAtIEJldHdlZW4gYSBSb2NrIGFuZCBhIEhhcmQgUGxhY2UgSW50ZXJuYXRpb25hbCBOR09zIGFuZCB0aGUgRHVhbC5wZGYAEwABLwAAFQACAA3//wAAAAgADQAaACQAgwAAAAAAAAIBAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAMD}}

Zotero doesn’t parse that gnarly field—it needs a field named file—and it doesn’t decode that messy string into a plain text file path, so the attached PDF won’t get imported correctly.

However, thanks to Emiliano Heyns, the Better BibTeX add-on will automatically convert these base64-encoded paths to plain text fields that Zotero can work with just fine. All PDFs will import automatically!

Customizing Zotero’s renaming rules

I wanted all the PDFs that Zotero would manage to have nice predictable filenames. In BibDesk, I used this pattern:

citekey - First few words of title.pdf

That’s been fine, but it uses spaces in the file name and doesn’t remove any punctuation or special characters, so it was a little trickier to work with in the terminal or with scripts or for easy consistent searching (especially when searching in the iPad Dropbox app when looking for a PDF to read). But because I set up that pattern in 2008, path dependency kind of locked me in and I’ve been unwilling to change it since.

Since I’m starting with a whole new reference manager, I figured it was time to adopt a better PDF naming system. In the ZotFile preferences, I set this pattern:

…with - separating the three logical units (authors, year, title), and _ separating all the words within each unit (which follows Jenny Bryan’s principles of file naming). In practice, the pattern looks like this:

…but Zotero and/or ZotFile seems to hardwire _ as the space replacement in its titles. Oh well.

Citekeys

In BibDesk, I’ve had a citation key pattern that I’ve used for years: Lastname:Year, with up to three last names for coauthored things, and an incremental lowercase letter in the case of duplicates:

Zotero and Better BibTeX preserve citekeys when you import a .bib file, but I wanted to make sure I keep using this system for new items I add going forward, so I changed the Better BibTeX preferences to use the same pattern:

auth(0,1) + auth(0,2) + auth(0,3) + ":" + year

Post-import tweaks

With all that initial prep work done, I imported the .bib file into my Zotero library (File > Import…). I made sure “Place imported collections and items to new collection” was checked and that files were copied to the Zotero storage folder:

Ratings and read status

The Tags panel in Zotero then showed all the project/class-specific keywords from BibDesk, in addition to the ratings and read status tags I added previously:

I renamed each of the zzz* rating tags to use emoji stars and renamed the xxx read tag to use ✅.

Zotero has the ability to assign tags specific colors and pin them in a specific order, which also makes the tags display in the main Zotero library list. Following advice from the Zotero Tag extension, I pinned the read status ✅ tag as the first tag, the 5-star rating as the second tag, the 4-star rating as the third tag, and so on.

Now the read status and ratings tags are easily accessible and appear directly in the main Zotero library list!

Tags to collections

Zotero has two different methods for categorizing entries—tags and collections—while BibDesk / BibTeX only uses keywords, which Zotero treats as tags.

I decided that in Zotero I’d use both tags and collections. Tags are reserved for things like general topics, ratings, to-read designations, etc., while collections represent specific projects or classes.

I already assigned project- and class-specific keywords in BibDesk, so I just needed to move those keyworded entries into Zotero collections. There’s no way (that I could find) to include collection information in the .bib file and have it import into Zotero, so I ended up manually creating collections for each of the imported keywords. I filtered the library to only show items from one of the future collections, selected all the items, right-clicked, and chose “Add to collection” > “New collection…” and created a new collection. I then deleted the tag.

For instance, here’s what Zotero looked like after I assigned these 6 items, tagged as “Polsci 733”, to the new “Polsci 733” collection (shown in the folder in the sidebar). I just had to delete the tag after:

incollection / inbook and crossref

Tip

This used to cause problems with child references not importing fields from their parents, but thanks to Emiliano Heynes, this all works flawlessly if you have verison 6.7.47+ of Better BibTeX installed.

BibDesk natively supports the crossref field, which biber and biblatex use when working with LaTeX. This field lets you set up child/parent relationships with items, where children inherit fields from their parents. For instance, consider these two items—an edited book with lots of chapters from different authors and a chapter from that book:

@inbook{El-HusseiniToeplerSalamon:2004,author = {Hashem El-Husseini and Stefan Toepler and Lester M. Salamon},chapter = {12},crossref = {SalamonSokolowski:2004},pages = {227--32},title = {Lebanon}}@book{SalamonSokolowski:2004,address = {Bloomfield, CT},editor = {Lester M. Salamon and S. Wojciech Sokolowski},publisher = {Kumarian Press},title = {Global Civil Society: Dimensions of the Nonprofit Sector},volume = {2},year = {2004}}

In BibDesk, the chapter displays like this:

Fields like book title, publisher, year, etc., are all greyed out because they’re inherited from the parent book, with the citekey SalamonSokolowski:2004

If you install version 6.7.47+ of the Better BibTeX add-on, the chapter will inherit all the information from its parent book—the book title, date, publisher, etc., will all be imported correctly:

All done!

And with that, I have a complete version of my 15-year-old references.bib file inside Zotero!

Example workflow with Quarto / R Markdown / pandoc

Part of the reason I’ve been hesitant to switch away from BibDesk for so long is because I couldn’t figure out a way to connect a Markdown document to my Zotero database. With documents that get parsed through pandoc (like R Markdown or Quarto), you add a line in the YAML front matter to specify what file contains your references:

Since Zotero keeps everything in one big database, I didn’t see a way to add something like bibliography: My Zotero Database to the YAML front matter—pandoc requires that you point to a plain text file like .bib or .json or .yml, not a Zotero database.

However, the magical Better BibTeX add-on clarified everything for me and makes it super easy to point pandoc at a single file that contains a collection of reference items.

Export collection to .bib file

First, create a collection of items that you want to cite in your writing project. Since collections are like playlists and items can belong to multiple collections, there’s no need to manage duplicate entries or anything (like I was running into with Problem 2 above).

Right click on the collection name and choose “Export collection…”.

Change the format to “Better BibLaTeX”, check “Keep updated”, and choose a place to save the resulting .bib file.

Tip

You could also export it as “Better CSL JSON” or “Better CSL YAML”, which would create a .json or .yml file that you could then point to in your YAML front matter, which would keep everything in CSL format instead of converting things to .bib and back again (see Problem 3 above). However, in my academic writing projects I still like to let LaTeX, BibLaTeX, and biber handle the citation generation instead of pandoc for PDFs, so I still rely on .bib files. But if you’re not converting to PDF, or if you’re letting the CSL style template handle the citations instead of BibLaTeX, you should probably keep everything as JSON or YAML instead of .bib.

The “Keep updated” option is the magical part of this whole thing. If you add an item or edit an existing item in the collection in Zotero, Better BibTeX will automatically re-export the collection to the .bib file. You can have one central repository of citations and lots of dynamically updated plain text .bib files that you don’t have to edit or keep track of. Truly magical.

Point the .qmd / .Rmd / .md to the exported file

You’ll now have a .bib file that contains all the references that you can cite. Put that filename in your front matter (use .json or .yml if you export the file as JSON or YAML instead):

Because the front matter is pointed at a plain text .bib file that contains all the bibliographic references, it’ll generate the citations correctly. And because Better BibTeX is configured to automatically update the exported plain text file, any changes you make in Zotero will automatically be reflected. Again, this is magic.

RStudio-based alternative

Alternatively, if you write in RStudio, you can connect RStudio to your Zotero database and have it do a similar auto-export thing. You can also tell it to use Better BibTeX to keep things automatically synced:

One extra nice thing about using RStudio is its fancy Insert Citation dialog, which makes adding citations in Markdown just like adding citations in Word or Google Docs. It only works in the Visual Markdown Editor, though, which I don’t normally use, so I just use Better BibTeX alone rather than RStudio’s Zotero connection when I write in RStudio.

]]>writingmarkdowncitationspandoczoterohttps://www.andrewheiss.com/blog/2023/01/08/bibdesk-to-zotero-pandoc/index.htmlSun, 08 Jan 2023 05:00:00 GMTHow to use natural and base 10 log scales in ggplot2Andrew Heiss
https://www.andrewheiss.com/blog/2022/12/08/log10-natural-log-scales-ggplot/index.html
I always forget how to deal with logged values in ggplot—particularly things that use the natural log. The {scales} package was invented in part to allow users to adjust axes and scales in plots, including adjusting axes to account for logged values, but three have been some new developments in {scales} that have made existing answers (like this one on StackOverflow) somewhat obsolete (e.g. the trans_breaks() and trans_format() functions used there are superceded and deprecated).

So here’s a quick overview of how to use 2022-era {scales} to adjust axis breaks and labels to use both base 10 logs and natural logs. I’ll use data from the Gapminder project, since it has a nice exponentially-distributed measure of GDP per capita.

The distribution of GDP per capita is heavily skewed, with most countries reporting less than $10,000. As a result, the scatterplot makes an upside-down L shape. Try sticking a regression line on that and you’ll get in trouble.

ggplot(gapminder_2007, aes(x =gdpPercap, y =lifeExp, color =continent))+geom_point()+guides(color ="none")+labs(title ="GDP per capita", subtitle ="Original non-logged values")

Log base 10

ggplot comes with a built-in scale_x_log10() to transform the x-axis into logged values. It will automatically create pretty, logical breaks based on the data. Here, the breaks automatically go from 300 → 1000 → 3000 → 10000, and so on:

If we want to be mathy about the labels, we can format them as base 10 exponents using label_log():

ggplot(gapminder_2007, aes(x =gdpPercap, y =lifeExp, color =continent))+geom_point()+scale_x_log10(labels =label_log(digits =2))+guides(color ="none")+labs(title ="GDP per capita, log base 10", subtitle ="scale_x_log10() with exponentiated labels")+theme(panel.grid.minor =element_blank())

What if we don’t want the default 300, 1000, 3000, etc. breaks? In the interactive plot at gapminder.org, the breaks start at 500 and double after that: 500, 1000, 2000, 4000, 8000, etc. We can control our axis breaks by feeding a list of numbers to scale_x_log10() with the breaks argument. Instead of typing out every possible break, we can generate a list of numbers starting at 500 and then doubling (, , , and so on):

Log base 10 makes sense for visualizing things. Seeing the jumps from $500 → $1000 → $2000 is generally easy for people to understand (especially in today’s world of exponentially growing COVID cases). When working with logged values for statistical modeling, analysts prefer to use the natural log, or log base instead.

What the heck is ?

Here are a bunch of helpful resources explaining what and the natural log are and why analysts use them all the time:

The default logging function in R, log(), calculates the natural log (you have to use log10() or log(base = 10) to get base 10 logs).

Plotting natural logged values is a little trickier than base 10 values, since ggplot doesn’t have anything like scale_x_log_e(). But it’s still doable.

First, we can log the value on our own and just use the default scale_x_continuous() for labeling:

ggplot(gapminder_2007, aes(x =log(gdpPercap), y =lifeExp, color =continent))+geom_point()+guides(color ="none")+labs(title ="GDP per capita, natural log (base e)", subtitle ="GDP per capita logged manually")

Those 6, 7, 8, etc. breaks in the x-axis represent the power is raised to, like and . We can format these labels as exponents to make that clearer:

ggplot(gapminder_2007, aes(x =log(gdpPercap), y =lifeExp, color =continent))+geom_point()+scale_x_continuous(labels =label_math(e^.x))+guides(color ="none")+labs(title ="GDP per capita, natural log (base e)", subtitle ="GDP per capita logged manually, exponentiated labels")

To get these labels, we have to pre-log GDP per capita. We didn’t need to pre-log the varialb when using scale_x_log10(), since that logs things for us. We can have the scale_x_*() function handle the natural logging for us too by specifying trans = log_trans():

ggplot(gapminder_2007, aes(x =gdpPercap, y =lifeExp, color =continent))+geom_point()+scale_x_continuous(trans =log_trans())+guides(color ="none")+labs(title ="GDP per capita, natural log (base e)", subtitle ="trans = log_trans()")

Everything is logged as expected, but those labels are gross—they’re , , and , but on the dollar scale:

We can format these breaks as -based exponents instead with label_math() (with the format = log argument to make the formatting function log the values first):

ggplot(gapminder_2007, aes(x =gdpPercap, y =lifeExp, color =continent))+geom_point()+scale_x_continuous(trans =log_trans(),# This breaks_log() thing happens behind the scenes and# isn't strictly necessary here# breaks = breaks_log(base = exp(1)), labels =label_math(e^.x, format =log))+guides(color ="none")+labs(title ="GDP per capita, natural log (base e)", subtitle ="trans = log_trans(), exponentiated labels")

If we want more breaks than 7, 9, 11, we can feed the scaling function a list of exponentiated breaks:

ggplot(gapminder_2007, aes(x =gdpPercap, y =lifeExp, color =continent))+geom_point()+scale_x_continuous(trans =log_trans(), breaks =exp(6:11), labels =label_math(e^.x, format =log))+guides(color ="none")+labs(title ="GDP per capita, natural log (base e)", subtitle ="trans = log_trans(), exponentiated labels, custom breaks")

]]>rtidyverseggplotdata visualizationhttps://www.andrewheiss.com/blog/2022/12/08/log10-natural-log-scales-ggplot/index.htmlThu, 08 Dec 2022 05:00:00 GMTMarginal and conditional effects for GLMMs with {marginaleffects}Andrew Heiss
https://www.andrewheiss.com/blog/2022/11/29/conditional-marginal-marginaleffects/index.html
As a field, statistics is really bad at naming things.

Take, for instance, the term “fixed effects.” In econometrics and other social science-flavored statistics, this typically refers to categorical terms in a regression model. Like, if we run a model like this with gapminder data…

library(gapminder)some_model<-lm(lifeExp~gdpPercap+country, data =gapminder)

…we can say that we’ve added “country fixed effects.”

That’s all fine and good until we come to the world of hierarchical or multilevel models, which has its own issues with nomenclature and can’t decide what to even call itself:

If we fit a model like this with country-based offsets to the intercept…

library(lme4)some_multilevel_model<-lmer(lifeExp~gdpPercap+(1|country), data =gapminder)

…then we get to say that there are “country random effects” or “country group effects”, while gdpPercap is actually a “fixed effect” or “population-level effect”

“Fixed effects” in multilevel models aren’t at all the same as “fixed effects” in econometrics-land.

Wild.

Another confusing term is the idea of “marginal effects.” One common definition of marginal effects is that they are slopes, or as the {marginaleffects} vignette says…

…partial derivatives of the regression equation with respect to each variable in the model for each unit in the data.

There’s a whole R package ({marginaleffects}) dedicated to calculating these, and I have a whole big long guide about this. Basically marginal effects are the change in the outcome in a regression model when you move one of the explanatory variables up a little while holding all other covariates constant.

But there’s also another definition (seemingly?) unrelated to the idea of partial derivatives or slopes! And once again, it’s a key part of the multilevel model world. I’ve run into it many times when reading about multilevel models (and I’ve even kind of alluded to it in past blog posts like this), but I’ve never fully understood what multilevel marginal effects are and how they’re different from slope-based marginal effects.

In multilevel models, you can calculate both marginal effects and conditional effects. Neither are necessarily related to slopes (though they both can be). They’re often mixed up. Even {brms} used to have a function named marginal_effects() that they’ve renamed to conditional_effects().

I’m not alone in my inability to remember the difference between marginal and conditional effects in multilevel models, it seems. Everyone mixes these up. TJ Mahr recently tweeted about the confusion:

TJ studies language development in children and often works with data with repeated child subjects. His typical models might look something like this, with observations grouped by child:

His data has child-based clusters, since individual children have repeated observations over time. We can find two different kinds of effects given this type of multilevel model: we can look at the effect of x1 or x2 in one typical child, or we can look at the effect of x1 or x2 across all children on average. The confusingly-named terms “conditional effect” and “marginal effect” refer to each of these “flavors” of effect:

Conditional effect = average child

Marginal effect = children on average

If we have country random effects like (1 | country) like I do in my own work, we can calculate the same two kinds of effects. Imagine a multilevel model like this:

library(lme4)some_multilevel_model<-lmer(lifeExp~gdpPercap+(1|country), data =gapminder)

Or more formally,

With this model, we can look at two different types of effects:

Conditional effect = effect of gdpPercap () in an average or typical country (where the random country offset is 0)

Marginal effect = average effect of gdpPercap ( again) across all countries (where the random country offset is dealt with… somehow…)

This conditional vs. marginal distinction applies to any sort of hierarchical structure in multilevel models:

Conditional effect = group-specific, subject-specific, cluster-specific, country-specific effect. We set all group-specific random offsets to 0 to find the effect for a typical group / subject / student / child / cluster / country / whatever.

Marginal effect = global population-level average effect, or global effect, where group-specific differences are averaged out or integrated out or held constant.

Calculating these different effects can be tricky, even with OLS-like normal or Gaussian regression, and interpreting them can get extra complicated with generalized linear mixed models (GLMMs) where we use links like Poisson, negative binomial, logistic, or lognormal families. The math with GLMMs gets complicated—particularly with lognormal models. Kristoffer Magnusson has several incredible blog posts that explore the exact math behind each of these effects in a lognormal GLMM.

Vincent Arel-Bundock’s magisterial {marginaleffects} R package can calculate both conditional and marginal effects automatically. I accidentally stumbled across the idea of multilevel marginal and conditional effects in an earlier blog post, but there I did everything with {emmeans} rather than {marginaleffects}, and as I explore here, {marginaleffects} is great for calculating average marginal effects (AMEs) rather than marginal effects at the mean (MEMs). Also in that earlier guide, I don’t really use this “conditional” vs. “marginal” distinction and just end up calling everything marginal. So everything here is more in line with the seemingly standard multilevel model ideas of “conditional” and “marginal” effects.

Let’s load some libraries, use some neat colors and a nice ggplot theme, and get started.

Magnusson’s data and model: the effect of a treatment on gambling losses

To make sure I’ve translated Magnusson’s math into the corresponding (and correct) {marginaleffects} syntax, I recreate his analysis here. He imagines some sort of intervention or treatment that is designed to reduce the amount of dollars lost in gambling each week (). The individuals in this situation are grouped into some sort of clusters—perhaps neighborhoods, states, or countries, or even the same individuals over time if we have repeated longitudinal observations. The exact kind of cluster doesn’t matter here—all that matters is that observations are nested in groups, and those groups have their own specific characteristics that influence individual-level outcomes. In this simulated data, there are 20 clusters, with 30 individuals in each cluster, with 600 total observations.

To be more formal about the structure, we can say that every outcome gets two subscripts for the cluster () and person inside each cluster (). We thus have where and . The nested, hierarchical, multilevel nature of the data makes the structure look something like this:

Kristoffer Magnusson’s original data generation code

#' Generate lognormal data with a random intercept#'#' @param n1 patients per cluster#' @param n2 clusters per treatment#' @param B0 log intercept#' @param B1 log treatment effect#' @param sd_log log sd#' @param u0 SD of log intercepts (random intercept)#'#' @return a data.framegen_data<-function(n1, n2, B0, B1, sd_log, u0){cluster<-rep(1:(2*n2), each =n1)TX<-rep(c(0, 1), each =n1*n2)u0<-rnorm(2*n2, sd =u0)[cluster]mulog<-(B0+B1*TX+u0)y<-rlnorm(2*n1*n2, meanlog =mulog, sdlog =sd_log)d<-data.frame(cluster,TX,y)d}set.seed(4445)pars<-list("n1"=30, # observations per cluster"n2"=10, # clusters per treatment"B0"=log(500),"B1"=log(0.5),"sd_log"=0.5,"u0"=0.5)d<-do.call(gen_data,pars)

The model of the effect of on gambling losses for individuals nested in clusters can be written formally like this, with cluster -specific offsets to the intercept term (i.e. , or cluster random effects):

We can fit this model with {brms} (or lme4::lmer() if you don’t want to be Bayesian):

fit<-brm(bf(y~1+TX+(1|cluster)), family =lognormal(), data =d, chains =4, iter =5000, warmup =1000, seed =4445)

fit## Family: lognormal ## Links: mu = identity; sigma = identity ## Formula: y ~ 1 + TX + (1 | cluster) ## Data: dat (Number of observations: 600) ## Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;## total post-warmup draws = 16000## ## Group-Level Effects: ## ~cluster (Number of levels: 20) ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## sd(Intercept) 0.63 0.12 0.45 0.92 1.00 2024 3522## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## Intercept 6.21 0.20 5.81 6.62 1.00 2052 3057## TX -0.70 0.29 -1.28 -0.13 1.00 2014 2843## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS## sigma 0.51 0.01 0.48 0.54 1.00 7316 8256## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

There are four parameters that we care about in that huge wall of text. We’ll pull them out as standalone objects (using TJ Mahr’s neat model-to-list trick) and show them in a table so we can keep track of everything easier.

fit%>%tidy()%>%mutate(Parameter =c("\\(\\beta_0\\)", "\\(\\beta_1\\)", "\\(\\sigma_0\\)", "\\(\\sigma_y\\)"))%>%mutate(Description =c("Global average gambling losses across all individuals","Effect of treatment on gambling losses for all individuals","Between-cluster variability of average gambling losses","Within-cluster variability of gambling losses"))%>%mutate(term =glue::glue("<code>{term}</code>"), estimate =round(estimate, 3))%>%select(Parameter, Term =term, Description, Estimate =estimate)%>%kbl(escape =FALSE)%>%kable_styling(full_width =FALSE)

Parameter

Term

Description

Estimate

\(\beta_0\)

(Intercept)

Global average gambling losses across all individuals

6.210

\(\beta_1\)

TX

Effect of treatment on gambling losses for all individuals

-0.702

\(\sigma_0\)

sd__(Intercept)

Between-cluster variability of average gambling losses

0.635

\(\sigma_y\)

sd__Observation

Within-cluster variability of gambling losses

0.507

There are a few problems with these estimates though: (1) they’re on the log odds scale, which isn’t very interpretable, and (2) neither the intercept term nor the term incorporate any details about the cluster-level effects beyond the extra information we get through partial pooling. So our goal here is to transform these estimates into something interpretable that also incorporates group-level information.

Conditional effects, or effect of a variable in an average cluster

Conditional effects

Conditional effects = average or typical cluster; random offsets set to 0

Conditional effects refer to the effect of a variable in a typical group—country, cluster, school, subject, or whatever else is in the (1 | group) term in the model. “Typical” here means that the random offset is set to zero, or that there are no random effects involved.

Average outcomes for a typical cluster

The average outcome across the possible values of for a typical cluster is formally defined as

Exactly how you calculate this mathematically depends on the distribution family. For a lognormal distribution, it is this:

We can calculate this automatically with marginaleffects::predictions() by setting re_formula = NA to ignore all random effects, or to set all the random offsets to zero:

Because we’re working with Bayesian posteriors, we might as well do neat stuff with them instead of just collapsing them down to single-number point estimates. The posteriordraws() function in {marginaleffects} lets us extract the modified/calculated MCMC draws, and then we can plot them with {tidybayes} / {ggdist}:

p_conditional_ate<-conditional_ate%>%ggplot(aes(x =draw))+stat_halfeye(fill =clrs[3])+scale_x_continuous(labels =label_dollar(style_negative ="minus"))+labs(x ="(TX = 1) − (TX = 0)", y ="Density", title ="Conditional cluster-specific ATE", subtitle ="Typical cluster where *b*<sub>0<sub>j</sub></sub> = 0")+coord_cartesian(xlim =c(-900, 300))+theme_nice()+theme(plot.subtitle =element_markdown())p_conditional_ate

Marginal effects, or effect of a variable across clusters on average

Marginal effects

Marginal effects = global/population-level effect; clusters on average; random offsets are incorporated into the estimate

Marginal effects refer to the global- or population-level effect of a variable. In multilevel models, coefficients can have random group-specific offsets to a global mean. That’s what the in is in the formal model we defined earlier:

By definition, these offsets are distributed normally with a mean of 0 and a standard deviation of , or sd__(Intercept) in {brms} output. We can visualize these cluster-specific offsets to get a better feel for how they work:

The intercept for Cluster 1 here is basically the same as the global coefficient; Cluster 19 has a big positive offset, while Cluster 11 has a big negative offset.

The model parameters show the whole range of possible cluster-specific intercepts, or :

ggplot()+stat_function(fun =~dnorm(., mean =B0, sd =sigma_0^2), geom ="area", fill =clrs[4])+xlim(4, 8)+labs(x ="Possible cluster-specific intercepts", y ="Density", title =glue::glue("Normal(µ = {round(B0, 3)}, σ = {round(sigma_0, 3)}<sup>2</sup>)"))+theme_nice()+theme(plot.title =element_markdown())

When generating population-level estimates, then, we need to somehow incorporate this range of possible cluster-specific intercepts into the population-level predictions. We can do this a couple different ways: we can (1) average, marginalize or integrate across them, or (2) integrate them out.

Average population-level outcomes

The average outcome across the possible values of for all clusters together is formally defined as

As with the conditional effects, the equation for calculating this depends on the family you’re using. For lognormal families, it’s this incredibly scary formula:

Wild. This is a mess because it integrates over the normally-distributed cluster-specific offsets, thus incorporating them all into the overall effect.

Brute force Monte Carlo integration, where we create a bunch of hypothetical cluster offsets with a mean of 0 and a standard deviation of , calculate the average outcome, then take the average of all those hypothetical clusters:

# A bunch of hypothetical cluster offsetssigma_0_i<-rnorm(1e5, 0, sigma_0)B_TXs%>%map(~{mean(exp(.+sigma_0_i+sigma_y^2/2))})## $`0`## [1] 694## ## $`1`## [1] 344

Those approaches are all great, but the math can get really complicated if there are interaction terms or splines or if you have more complex random effects structures (random slope offsets! nested groups!)

So instead we can use {marginaleffects} to handle all that complexity for us.

Average / marginalize / integrate across existing random effects: Here we calculate predictions for within each of the existing clusters. We then collapse them into averages for each level of . The values here are not identical to what we found with the earlier approaches, though they’re in the same general area. I’m not 100% why—I’m guessing it’s because there aren’t a lot of clusters to work with, so the averages aren’t really stable.

p_marginal_preds<-marginal_preds%>%ggplot(aes(x =draw, fill =factor(TX)))+stat_halfeye()+scale_fill_manual(values =colorspace::lighten(c(clrs[5], clrs[1]), 0.4))+scale_x_continuous(labels =label_dollar())+labs(x ="Gambling losses", y ="Density", fill ="TX", title ="Marginal population-level means", subtitle ="Random effects averaged / marginalized / integrated")+coord_cartesian(xlim =c(100, 1500))+theme_nice()p_marginal_preds

Integrate out random effects: Instead of using the existing cluster intercepts, we can integrate out the random effects by generating predictions for a bunch of clusters (like 100), and then collapse those predictions into averages. This is similar to the intuition of brute force Monte Carlo integration in approach #3 earlier. This takes a long time! It results in the same estimates we found with the mathematical approaches in #1, #2, and #3 earlier.

The average treatment effect (ATE) for a binary treatment is the difference between the two averages when and , after somehow incorporating all the random cluster-specific offsets:

For a lognormal family, it’s this terrifying thing:

That looks scary, but really it’s just the difference in the two estimates we found before: and . We can use the same approaches from above and just subtract the two estimates, like this with the magical moment-generating function thing:

Population-level ATE with moment-generating function:

We can do this with {marginaleffects} too, either by averaging / marginalizing / integrating across existing clusters (though again, this weirdly gives slightly different results) or by integrating out the random effects from a bunch of hypothetical clusters (which gives the same result as the more analytical / mathematical estimates):

Average / marginalize / integrate across existing random effects:

# Marginal treatment effect (or global population level effect)comparisons(fit, variables ="TX", re_formula =NULL)%>%tidy()

## # A tibble: 1 × 6## type term contrast estimate conf.low conf.high## <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 response TX 1 - 0 -326. -652. -60.9

Finally, we can work directly with the coefficients to get more slope-like effects, which is especially helpful when the coefficient of interest isn’t for a binary variable. Typically with GLMs with log or logit links (like logit, Poisson, negative binomial, lognormal, etc.) we can exponentiate the coefficient to get it as an odds ratio or a multiplicative effect. That works here too:

A one-unit increase in causes a 51% decrease (exp(B1) - 1) in the outcome. Great.

That’s all fine here because the lognormal model doesn’t have any weird nonlinearities or interactions, but in the case of logistic regression or anything with interaction terms, life gets more complicated, so it’s better to work with marginaleffects() instead of exponentiating things by hand. If we use type = "link" we’ll keep the results as logged odds, and then we can exponentiate them. All the other random effects options that we used before (re_formula = NA, re_formula = NULL, integrating effects out, and so on) work here too.

marginaleffects(fit, variable ="TX", type ="link", newdata =datagrid(TX =0))%>%mutate(across(c(dydx, conf.low, conf.high), ~exp(.)))%>%select(rowid, type, term, dydx, conf.low, conf.high)## rowid type term dydx conf.low conf.high## 1 1 link TX 0.496 0.279 0.88

We can visualize the odds-ratio-scale posterior for fun:

If we use type = "response", we can get slopes at specific values of the coefficient (which is less helpful here, since can only be 0 or 1; but it’s useful for continuous coefficients of interest).

Summary

Phew, that was a lot. Here’s a summary table to reference to help keep things straight.

]]>rtidyverseregressionstatisticsbayesbrmslognormalhttps://www.andrewheiss.com/blog/2022/11/29/conditional-marginal-marginaleffects/index.htmlTue, 29 Nov 2022 05:00:00 GMTVisualizing the differences between Bayesian posterior predictions, linear predictions, and the expectation of posterior predictionsAndrew Heiss
https://www.andrewheiss.com/blog/2022/09/26/guide-visualizing-types-posteriors/index.html

Downloadable cheat sheets!

You can download PDF, SVG, and PNG versions of the diagrams and cheat sheets in this post, as well as the original Adobe Illustrator and InDesign files, at the bottom of this post

Something that has always plagued me about working with Bayesian posterior distributions, but that I’ve always waved off as too hard to think about, has been the differences between posterior predictions, the expectation of the posterior predictive distribution, and the posterior of the linear predictor (or posterior_predict(), posterior_epred(), and posterior_linpred() in the brms world). But reading these two books has forced me to finally figure it out.

So here’s an explanation of my mental model of the differences between these types of posterior distributions. It’s definitely not 100% correct, but it makes sense for me.

For bonus fun, skip down to the incredibly useful diagrams and cheat sheets at the bottom of this post.

Let’s load some packages, load some data, and get started!

library(tidyverse)# ggplot, dplyr, and friendslibrary(patchwork)# Combine ggplot plotslibrary(ggtext)# Fancier text in ggplot plotslibrary(scales)# Labeling functionslibrary(brms)# Bayesian modeling through Stanlibrary(tidybayes)# Manipulate Stan objects in a tidy waylibrary(marginaleffects)# Calculate marginal effectslibrary(modelr)# For quick model gridslibrary(extraDistr)# For dprop() beta distribution with mu/philibrary(distributional)# For plotting distributions with ggdistlibrary(palmerpenguins)# Penguins!library(kableExtra)# For nicer tables# Make random things reproducibleset.seed(1234)# Bayes stuff# Use the cmdstanr backend for Stan because it's faster and more modern than# the default rstan. You need to install the cmdstanr package first# (https://mc-stan.org/cmdstanr/) and then run cmdstanr::install_cmdstan() to# install cmdstan on your computer.options(mc.cores =4, # Use 4 cores brms.backend ="cmdstanr")bayes_seed<-1234# Colors from MetBrewerclrs<-MetBrewer::met.brewer("Java")# Custom ggplot themes to make pretty plots# Get Roboto Condensed at https://fonts.google.com/specimen/Roboto+Condensed# Get Roboto Mono at https://fonts.google.com/specimen/Roboto+Monotheme_pred<-function(){theme_minimal(base_family ="Roboto Condensed")+theme(panel.grid.minor =element_blank(), plot.background =element_rect(fill ="white", color =NA), plot.title =element_text(face ="bold"), strip.text =element_text(face ="bold"), strip.background =element_rect(fill ="grey80", color =NA), axis.title.x =element_text(hjust =0), axis.title.y =element_text(hjust =0), legend.title =element_text(face ="bold"))}theme_pred_dist<-function(){theme_pred()+theme(plot.title =element_markdown(family ="Roboto Condensed", face ="plain"), plot.subtitle =element_text(family ="Roboto Mono", size =rel(0.9), hjust =0), axis.text.y =element_blank(), panel.grid.major.y =element_blank(), panel.grid.minor.y =element_blank())}theme_pred_range<-function(){theme_pred()+theme(plot.title =element_markdown(family ="Roboto Condensed", face ="plain"), plot.subtitle =element_text(family ="Roboto Mono", size =rel(0.9), hjust =0), panel.grid.minor.y =element_blank())}update_geom_defaults("text", list(family ="Roboto Condensed", lineheight =1))

# Add a couple new variables to the penguins data:# - is_gentoo: Indicator for whether or not the penguin is a Gentoo# - bill_ratio: The ratio of a penguin's bill depth (height) to its bill lengthpenguins<-penguins|>drop_na(sex)|>mutate(is_gentoo =species=="Gentoo")|>mutate(bill_ratio =bill_depth_mm/bill_length_mm)

Normal Gaussian model

First we’ll look at basic linear regression. Normal or Gaussian models are roughly equivalent to frequentist ordinary least squares (OLS) regression. We estimate an intercept and a slope and draw a line through the data. If we include multiple explanatory variables or predictors, we’ll have multiple slopes, or partial derivatives or marginal effects (see here for more about that). But to keep things as simple and basic and illustrative as possible, we’ll just use one explanatory variable here.

In this example, we’re interested in the relationship between penguin flipper length and penguin body mass. Do penguins with longer flippers weigh more? Here’s what the data looks like:

It seems like there’s a pretty clear relationship between the two. As flipper length increases, body mass also increases.

We can create a more formal model for the distribution of body mass, conditional on different values of flipper length, like this:

Or more generally:

This implies that body mass follows a normal (or Gaussian) distribution with some average () and some amount of spread (), and that the parameter is conditional on (or based on, or dependent on) flipper length.

Let’s run that model in Stan through brms (with all the default priors; in real life you’d want to set more official priors for the intercept , the coefficient , and the overall model spread )

model_normal<-brm(bf(body_mass_g~flipper_length_mm), family =gaussian(), data =penguins)## Start sampling

If we look at the model results, we can see the means of the posterior distributions of each of the model’s parameters (, , and ). The intercept () is huge and negative because flipper length is far away from 0, so it’s pretty uninterpretable. The coefficient shows that a one-mm increase in flipper length is associated with a 50 gram increase in body mass. And the overall model standard deviation shows that there’s roughly 400 grams of deviation around the mean body mass.

That table shows just the posterior means for each of these parameters, but these are technically all complete distributions. In this post we’re not interested in these actual values—we’re concerned with the outcome, or penguin weight here. (But you can see this post or this post or this post or this documentation for more about working with these coefficients and calculating marginal effects)

Going back to the formal model, so far we’ve looked at , , and , but what about and the overall posterior distribution of the outcome (or )? This is where life gets a little trickier (and why this guide exists in the first place!). Both and the posterior for represent penguin body mass, but conceptually they’re different things. We’ll extract these different distributions with three different brms functions: posterior_predict(), posterior_epred(), and posterior_linpred() (the code uses predicted_draws(), epred_draws(), and linpred_draws(); these are tidybayes’s wrappers for the corresponding brms functions).

Note the newdata argument here. We have to feed a data frame of values to plug into to make these different posterior predictions. We could feed the original dataset with newdata = penguins, which would plug each row of the data into the model and generate 4000 posterior draws for it. Given that there are 333 rows in penguins data, using newdata = penguins would give us 333 × 4,000 = 1,332,000 rows. That’s a ton of data, and looking at it all together like that isn’t super useful unless we look at predictions across a range of possible predictors. We’ll do that later in this section and see the posterior predictions of weights across a range of flipper lengths. But here we’re just interested in the prediction of the outcome based on a single value of flipper lengths. We’ll use the average (200.967 mm), but it could easily be the median or whatever arbitrary number we want.

# Make a little dataset of just the average flipper lengthpenguins_avg_flipper<-penguins|>summarize(flipper_length_mm =mean(flipper_length_mm))# Extract different types of posteriorsnormal_linpred<-model_normal|>linpred_draws(newdata =penguins_avg_flipper)normal_epred<-model_normal|>epred_draws(newdata =penguins_avg_flipper)normal_predicted<-model_normal|>predicted_draws(newdata =penguins_avg_flipper, seed =12345)# So that the manual results with rnorm() are the same later

These each show the posterior distribution of penguin weight, and each corresponds to a different part of the formal mathematical model with. We can explore these nuances if we look at these distributions’ means, medians, standard deviations, and overall shapes:

Code

summary_normal_linpred<-normal_linpred|>ungroup()|>summarize(across(.linpred, lst(mean, sd, median), .names ="{.fn}"))summary_normal_epred<-normal_epred|>ungroup()|>summarize(across(.epred, lst(mean, sd, median), .names ="{.fn}"))summary_normal_predicted<-normal_predicted|>ungroup()|>summarize(across(.prediction, lst(mean, sd, median), .names ="{.fn}"))tribble(~Function, ~`Model element`,"<code>posterior_linpred()</code>", "\\(\\mu\\) in the model","<code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\mu\\) in the model","<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Normal}(\\mu_i, \\sigma)\\)")|>bind_cols(bind_rows(summary_normal_linpred, summary_normal_epred, summary_normal_predicted))|>kbl(escape =FALSE)|>kable_styling()

Random draws from posterior \(\operatorname{Normal}(\mu_i, \sigma)\)

4207

386.7

4209

p1<-ggplot(normal_linpred, aes(x =.linpred))+stat_halfeye(fill =clrs[3])+scale_x_continuous(labels =label_comma())+coord_cartesian(xlim =c(4100, 4300))+labs(x ="Body mass (g)", y =NULL, title ="**Linear predictor** <span style='font-size: 14px;'>*µ* in the model</span>", subtitle ="posterior_linpred(..., tibble(flipper_length_mm = 201))")+theme_pred_dist()+theme(plot.title =element_markdown())p2<-ggplot(normal_epred, aes(x =.epred))+stat_halfeye(fill =clrs[2])+scale_x_continuous(labels =label_comma())+coord_cartesian(xlim =c(4100, 4300))+labs(x ="Body mass (g)", y =NULL, title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] and *µ* in the model</span>", subtitle ="posterior_epred(..., tibble(flipper_length_mm = 201))")+theme_pred_dist()p3<-ggplot(normal_predicted, aes(x =.prediction))+stat_halfeye(fill =clrs[1])+scale_x_continuous(labels =label_comma())+coord_cartesian(xlim =c(2900, 5500))+labs(x ="Body mass (g)", y =NULL, title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>", subtitle ="posterior_predict(..., tibble(flipper_length_mm = 201))")+theme_pred_dist()(p1/plot_spacer()/p2/plot_spacer()/p3)+plot_layout(heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

The most obvious difference between these different posterior predictions is the range of predictions. For posterior_linpred() and posterior_epred(), the standard error is tiny and the range of plausible predicted values is really narrow. For posterior_predict(), the standard error is substantially bigger, and the corresponding range of predicted values is huge.

To understand why, let’s explore the math going on behind the scenes in these functions. Both posterior_linpred() and posterior_epred() correspond to the part of the model. They’re the average penguin weight as predicted by the linear model (hence linpred; linear predictor). We can see this if we plug a 201 mm flipper length into each row of the posterior and calculate mu by hand with :

That mu column is identical to what we calculate with posterior_linpred(). Just to confirm, we can plot the two distributions:

p1_manual<-linpred_manual|>ggplot(aes(x =mu))+stat_halfeye(fill =colorspace::lighten(clrs[3], 0.5))+scale_x_continuous(labels =label_comma())+coord_cartesian(xlim =c(4100, 4300))+labs(x ="Body mass (g)", y =NULL, title ="**Linear predictor** <span style='font-size: 14px;'>*µ* in the model</span>", subtitle ="b_Intercept + (b_flipper_length_mm * 201)")+theme_pred_dist()+theme(plot.title =element_markdown())p1_manual|p1

Importantly, the distribution of the part of the model here does not incorporate information about . That’s why the distribution is so narrow.

The results from posterior_predict(), on the other hand, correspond to the part of the model. Officially, they are draws from a random normal distribution using both the estimated and the estimated . These results contain the full uncertainty of the posterior distribution of penguin weight. To help with the intuition, we can do the same thing by hand when plugging in a 201 mm flipper length:

set.seed(12345)# To get the same results as posterior_predict() from earlierpostpred_manual<-model_normal|>spread_draws(b_Intercept, b_flipper_length_mm, sigma)|>mutate(mu =b_Intercept+(b_flipper_length_mm*penguins_avg_flipper$flipper_length_mm), # This is posterior_linpred() y_new =rnorm(n(), mean =mu, sd =sigma))# This is posterior_predict()postpred_manual|>select(.draw:y_new)## # A tibble: 4,000 × 6## .draw b_Intercept b_flipper_length_mm sigma mu y_new## <int> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 -6152. 51.5 384. 4204. 4429.## 2 2 -5872 50.2 401. 4221. 4506.## 3 3 -6263. 52.1 390. 4202. 4159.## 4 4 -6066. 51.1 409. 4213. 4027.## 5 5 -5740. 49.4 362. 4191. 4411.## 6 6 -5678. 49.2 393. 4213. 3499.## 7 7 -6107. 51.1 417. 4160. 4423.## 8 8 -5422. 48.0 351. 4235. 4138.## 9 9 -6303. 52.1 426. 4177. 4055.## 10 10 -6193. 51.6 426. 4184. 3793.## # … with 3,990 more rows

That y_new column here is the part of the model and should have a lot more uncertainty than the mu column, which is just the part of the model. Notably, the y_new column is the same as what we get when using posterior predict(). We’ll plot the two distributions to confirm:

p3_manual<-postpred_manual|>ggplot(aes(x =y_new))+stat_halfeye(fill =colorspace::lighten(clrs[1], 0.5))+scale_x_continuous(labels =label_comma())+coord_cartesian(xlim =c(2900, 5500))+labs(x ="Body mass (g)", y =NULL, title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>", subtitle ="rnorm(b_Intercept + (b_flipper_length_mm * 201), sigma)")+theme_pred_dist()+theme(plot.title =element_markdown())p3_manual|p3

The results from posterior_predict() and posterior_linpred() have the same mean, but the full posterior predictions that incorporate the estimated have a much wider range of plausible values.

The results from posterior_epred() are a little strange to understand, and in the case of normal/Gaussian regression (and many other types of regression models!), they’re identical to the linear predictor (posterior_linpred()). These are the posterior draws of the expected value or mean of the the posterior distribution, or in the model. Behind the scenes, this is calculated by taking the average of each row’s posterior distribution and then taking the average of that.

Once again, a quick illustration can help. As before, we’ll manually plug a flipper length of 201 mm into the posterior estimates of the intercept and slope to calculate the part of the model. We’ll then use that along with the estimated to in rnorm() to generate the posterior predictive distribution, or the part of the model. Finally, we’ll take the average of the y_new posterior predictive distribution to get the expectation of the posterior predictive distribution, or epred. It’s the same as what we get when using posterior_epred(); the only differences are because of randomness.

epred_manual<-model_normal|>spread_draws(b_Intercept, b_flipper_length_mm, sigma)|>mutate(mu =b_Intercept+(b_flipper_length_mm*penguins_avg_flipper$flipper_length_mm), # This is posterior_linpred() y_new =rnorm(n(), mean =mu, sd =sigma))# This is posterior_predict()# This is posterior_epred()epred_manual|>summarize(epred =mean(y_new))## # A tibble: 1 × 1## epred## <dbl>## 1 4202.# It's essentially the same as the actual posterior_epred()normal_epred|>ungroup()|>summarize(epred =mean(.epred))## # A tibble: 1 × 1## epred## <dbl>## 1 4206.

We can also look at these different types of posterior predictions across a range of possible flipper lengths. There’s a lot more uncertainty in the full posterior, since it incorporates the uncertainty of both and , while the uncertainty of the linear predictor/expected value of the posterior is much more narrow (and equivalent in this case):

p1<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_linpred_draws(model_normal, ndraws =100)|>ggplot(aes(x =flipper_length_mm))+stat_lineribbon(aes(y =.linpred), .width =0.95, alpha =0.5, color =clrs[3], fill =clrs[3])+geom_point(data =penguins, aes(y =body_mass_g), size =1, alpha =0.7)+scale_y_continuous(labels =label_comma())+coord_cartesian(ylim =c(2000, 6000))+labs(x ="Flipper length (mm)", y ="Body mass (g)", title ="**Linear predictor** <span style='font-size: 14px;'>*µ* in the model</span>", subtitle ="posterior_linpred()")+theme_pred_range()p2<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_epred_draws(model_normal, ndraws =100)|>ggplot(aes(x =flipper_length_mm))+stat_lineribbon(aes(y =.epred), .width =0.95, alpha =0.5, color =clrs[2], fill =clrs[2])+geom_point(data =penguins, aes(y =body_mass_g), size =1, alpha =0.7)+scale_y_continuous(labels =label_comma())+coord_cartesian(ylim =c(2000, 6000))+labs(x ="Flipper length (mm)", y ="Body mass (g)", title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] and *µ* in the model</span>", subtitle ="posterior_epred()")+theme_pred_range()p3<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_predicted_draws(model_normal, ndraws =100)|>ggplot(aes(x =flipper_length_mm))+stat_lineribbon(aes(y =.prediction), .width =0.95, alpha =0.5, color =clrs[1], fill =clrs[1])+geom_point(data =penguins, aes(y =body_mass_g), size =1, alpha =0.7)+scale_y_continuous(labels =label_comma())+coord_cartesian(ylim =c(2000, 6000))+labs(x ="Flipper length (mm)", y ="Body mass (g)", title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>", subtitle ="posterior_predict()")+theme_pred_range()(p1/plot_spacer()/p2/plot_spacer()/p3)+plot_layout(heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

Phew. There are a lot of moving parts here with different types of posteriors and averages and variances. Here’s a helpful diagram that shows how everything is connected and which R functions calculate which parts:

Generalized linear models with link transformations

Generalized linear models (e.g., logistic, probit, ordered logistic, exponential, Poisson, negative binomial, etc.) use special link functions (e.g. logit, log, etc.) to transform the likelihood of an outcome into a scale that is more amenable to linear regression.

Estimates from these models can be used in their transformed scales (e.g., log odds in logistic regression) or can be back-transformed into their original scale (e.g., probabilities in logistic regression).

When working with links, the various Bayesian prediction functions return values on different scales, each corresponding to different parts of the model.

Logistic regression example

To show how different link functions work with posteriors from generalized linear models, we’ll use logistic regression with a single explanatory variable (again, for the sake of illustrative simplicity). We’re interested in whether a penguin’s bill length can predict if a penguin is a Gentoo or not. Here’s what the data looks like—Gentoos seem to have taller bills than their Chinstrap and Adélie counterparts.

ggplot(penguins, aes(x =bill_length_mm, y =as.numeric(is_gentoo)))+geom_dots(aes(side =ifelse(is_gentoo, "bottom", "top")), pch =19, color ="grey20", scale =0.2)+geom_smooth(method ="glm", method.args =list(family =binomial(link ="logit")), color =clrs[5], se =FALSE)+scale_y_continuous(labels =label_percent())+labs(x ="Bill length (mm)", y ="Probability of being a Gentoo")+theme_pred()

Instead, we can transform the outcome variable from 0s and 1s into logged odds or logits, which creates a nice straight line that we can use with regular old linear regression. Again, I won’t go into the details of how logistic regression works here (see this example or this tutorial or this post or this post for lots more about it).

Just know that logits (or log odds) are a transformation of probabilities () into a different scale using on this formula:

This plot shows the relationship between the two scales. Probabilities range from 0 to 1, while logits typically range from −4 to 4ish, where logit of 0 is a of 0.5. There are big changes in probability between −4ish and 4ish, but once you start getting into the 5s and beyond, the probability is all essentially the same.

tibble(x =seq(-8, 8, by =0.1))|>mutate(y =plogis(x))|>ggplot(aes(x =x, y =y))+geom_line(size =1, color =clrs[4])+labs(x ="Logit scale", y ="Probability scale")+theme_pred()

We can create a formal model for the probability of being a Gentoo following a binomial distribution with a size of 1 (i.e. the distribution contains only 0s and 1s—either the penguin is a Gentoo or it is not), and a probability that is conditional on different values of bill length:

model_logit<-brm(bf(is_gentoo~bill_length_mm), family =bernoulli(link ="logit"), data =penguins)## Start sampling

We could look at these coefficients and interpret their marginal effects, but here we’re more interested in the distribution of the outcome, not the coefficients (see here or here or here for examples of how to interpret logistic regression coefficients).

Let’s again extract these different posterior distributions with the three main brms functions: posterior_linpred(), posterior_epred(), and posterior_predict(). We’ll look at the posterior distribution when bill_length_mm is its average value, or 43.993:

# Make a little dataset of just the average bill lengthpenguins_avg_bill<-penguins|>summarize(bill_length_mm =mean(bill_length_mm))# Extract different types of posteriorslogit_linpred<-model_logit|>linpred_draws(newdata =penguins_avg_bill)logit_epred<-model_logit|>epred_draws(newdata =penguins_avg_bill)logit_predicted<-model_logit|>predicted_draws(newdata =penguins_avg_bill)

These each show the posterior distribution of being a Gentoo, but unlike the Gaussian posteriors we looked at earlier, each of these is measured completely differently now!

Code

summary_logit_linpred<-logit_linpred|>ungroup()|>summarize(across(.linpred, lst(mean, sd, median), .names ="{.fn}"))summary_logit_epred<-logit_epred|>ungroup()|>summarize(across(.epred, lst(mean, sd, median), .names ="{.fn}"))summary_logit_predicted<-logit_predicted|>ungroup()|>summarize(across(.prediction, lst(mean), .names ="{.fn}"))tribble(~Function, ~`Model element`, ~Values,"<code>posterior_linpred()</code>", "\\(\\operatorname{logit}(\\pi)\\) in the model", "Logits or log odds","<code>posterior_linpred(transform = TRUE)</code> or <code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\pi\\) in the model", "Probabilities","<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Binomial}(1, \\pi)\\)", "0s and 1s")|>bind_cols(bind_rows(summary_logit_linpred, summary_logit_epred, summary_logit_predicted))|>kbl(escape =FALSE)|>kable_styling()

Random draws from posterior \(\operatorname{Binomial}(1, \pi)\)

0s and 1s

0.306

p1<-ggplot(logit_linpred, aes(x =.linpred))+stat_halfeye(fill =clrs[3])+coord_cartesian(xlim =c(-1.5, -0.2))+labs(x ="Logit-transformed probability of being a Gentoo", y =NULL, title ="**Linear predictor** <span style='font-size: 14px;'>logit(*π*) in the model</span>", subtitle ="posterior_linpred(..., tibble(bill_length_mm = 44))")+theme_pred_dist()p2<-ggplot(logit_epred, aes(x =.epred))+stat_halfeye(fill =clrs[2])+scale_x_continuous(labels =label_percent())+coord_cartesian(xlim =c(0.2, 0.45))+labs(x ="Probability of being a Gentoo", y =NULL, title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] and *π* in the model</span>", subtitle ="posterior_epred(..., tibble(bill_length_mm = 44))")+theme_pred_dist()p3<-logit_predicted|>count(is_gentoo =.prediction)|>mutate(prop =n/sum(n), prop_nice =label_percent(accuracy =0.1)(prop))|>ggplot(aes(x =factor(is_gentoo), y =n))+geom_col(fill =clrs[1])+geom_text(aes(label =prop_nice), nudge_y =-300, color ="white", size =3)+scale_x_discrete(labels =c("Not Gentoo (0)", "Gentoo (1)"))+scale_y_continuous(labels =label_comma())+labs(x ="Prediction of being a Gentoo", y =NULL, title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Binomial(1, *π*)</span>", subtitle ="posterior_predict(..., tibble(bill_length_mm = 44))")+theme_pred_range()+theme(panel.grid.major.x =element_blank())(p1/plot_spacer()/p2/plot_spacer()/p3)+plot_layout(heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

Unlike the Gaussian/normal regression from earlier, the results from posterior_epred() and posterior_linpred() are not identical here. They still both correspond to the part of the model, but on different scales. posterior_epred() provides results on the probability scale, un-logiting and back-transforming the results from posterior_linpred() (which provides results on the logit scale).

Again, technically, posterior_epred() isn’t just the back-transformed linear predictor (if you want that, you can use posterior_linpred(..., transform = TRUE)). More formally, posterior_epred() returns the expected values of the posterior, or , or the average of the posterior’s averages. But as with Gaussian regression, for mathy reasons this average-of-averages happens to be the same as the back-transformed , so .

The results from posterior_predict() are draws from a random binomial distribution using the estimated , and they consist of only 0s and 1s (not Gentoo and Gentoo).

Showing these posterior predictions across a range of bill lengths also helps with the intuition here and illustrates the different scales and values that these posterior functions return:

posterior_epred() returns the value of on the probability scale (technically it’s returning , but in practice those are identical here)

posterior_predict() returns 0s and 1s, plotted here as points at bill lengths of 35, 45, and 55 mm

pred_logit_gentoo<-tibble(bill_length_mm =c(35, 45, 55))|>add_predicted_draws(model_logit, ndraws =500)pred_logit_gentoo_summary<-pred_logit_gentoo|>group_by(bill_length_mm)|>summarize(prop =mean(.prediction), prop_nice =paste0(label_percent(accuracy =0.1)(prop), "\nGentoos"))p1<-penguins|>data_grid(bill_length_mm =seq_range(bill_length_mm, n =100))|>add_linpred_draws(model_logit, ndraws =100)|>ggplot(aes(x =bill_length_mm))+stat_lineribbon(aes(y =.linpred), .width =0.95, alpha =0.5, color =clrs[3], fill =clrs[3])+coord_cartesian(xlim =c(30, 60))+labs(x ="Bill length (mm)", y ="Logit-transformed\nprobability of being a Gentoo", title ="**Linear predictor posterior** <span style='font-size: 14px;'>logit(*π*) in the model</span>", subtitle ="posterior_linpred()")+theme_pred_range()p2<-penguins|>data_grid(bill_length_mm =seq_range(bill_length_mm, n =100))|>add_epred_draws(model_logit, ndraws =100)|>ggplot(aes(x =bill_length_mm))+geom_dots(data =penguins, aes(y =as.numeric(is_gentoo), x =bill_length_mm, side =ifelse(is_gentoo, "bottom", "top")), pch =19, color ="grey20", scale =0.2)+stat_lineribbon(aes(y =.epred), .width =0.95, alpha =0.5, color =clrs[2], fill =clrs[2])+scale_y_continuous(labels =label_percent())+coord_cartesian(xlim =c(30, 60))+labs(x ="Bill length (mm)", y ="Probability of\nbeing a Gentoo", title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] and *π* in the model</span>", subtitle ="posterior_epred()")+theme_pred_range()p3<-ggplot(pred_logit_gentoo, aes(x =factor(bill_length_mm), y =.prediction))+geom_point(position =position_jitter(width =0.2, height =0.1, seed =1234), size =0.75, alpha =0.3, color =clrs[1])+geom_text(data =pred_logit_gentoo_summary, aes(y =0.5, label =prop_nice), size =3)+scale_y_continuous(breaks =c(0, 1), labels =c("Not\nGentoo", "Gentoo"))+labs(x ="Bill length (mm)", y ="Prediction of\nbeing a Gentoo", title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Binomial(1, *π*)</span>", subtitle ="posterior_predict()")+theme_pred_range()+theme(panel.grid.major.x =element_blank(), panel.grid.major.y =element_blank(), axis.text.y =element_text(angle =90, hjust =0.5))(p1/plot_spacer()/p2/plot_spacer()/p3)+plot_layout(heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

There are a lot more moving parts here than with Gaussian regression, with different types of posteriors measured on three different scales! This diagram summarizes everything:

Distributional models with link transformations

Regression models often focus solely on the location parameter of the model (e.g., in ; in ). However, it is also possible to specify separate predictors for the scale or shape parameters of models (e.g., in , in ). In the world of brms, these are called distributional models.

More complex models can use a collection of distributional parameters. Zero-inflated beta models estimate a mean , precision , and a zero-inflated parameter zi, while hurdle lognormal models estimate a mean , scale , and a hurdle parameter hu. Even plain old Gaussian models become distributional models when a set of predictors is specified for (e.g. brm(y ~ x1 + x2, sigma ~ x2 + x3)).

When working with extra distributional parameters, the various Bayesian posterior prediction functions return values on different scales for each different component of the model, making life even more complex! Estimates and distributional parameters (what brms calls dpar in its functions) from these models can be used in their transformed scales or can be back-transformed into their original scale.

Beta regression example

To show how different link functions and distributional parameters work with posteriors from distributional models, we’ll use beta regression with a single explanatory variable. The penguin data we’ve been using doesn’t have any variables that are proportions or otherwise constrained between 0 and 1, so we’ll make one up. Here we’re interested in the the ratio of penguin bill depth (equivalent to the height of the bill; see this illustration) to bill length and whether flipper length influences that ratio. I know nothing about penguins (or birds, for that matter), so I don’t know if biologists even care about the depth/length ratio in bills, but it makes a nice proportion so we’ll go with it.

Here’s what the relationship looks like—as flipper length increases, the bill ratio decreases. Longer-flippered penguins have shorter and longer bills; shorter-flippered penguins have taller bills in proportion to their lengths. Or something like that.

ggplot(penguins, aes(x =flipper_length_mm, y =bill_ratio))+geom_point(size =1, alpha =0.7)+geom_smooth(method ="lm", color =clrs[5], se =FALSE)+labs(x ="Flipper length (mm)", y ="Ratio of bill depth / bill length")+theme_pred()

We want to model that green line, and in this case it appears nice and straight and could probably be modeled with regular Gaussian regression, but we also want to make sure any predictions are constrained between 0 and 1 since we’re working with a proportion. Beta regression is perfect for this. Once again, I won’t go into detail about how beta models work—I have a whole detailed guide to it here.

With beta regression, we need to model two parameters of the beta distribution—the mean and the precision . Ordinarily beta distributions are actually defined by two other parameters, called either shape 1 and shape 2 or and . The two systems of parameters are closely related and you can switch between them with a little algebra—see this guide for an example of how.

We can create a formal model for the distribution of the ratio of bill depth to bill length with a beta distribution with a mean and precision , each of which are conditional on different values of flipper length. The models for and don’t have to use the same explanatory variables—I’m just doing that here for the sake of simplicity.

Or more generally,

Let’s fit the model! But first, we’ll actually set more specific priors this time instead of relying on the defaults. Since is on the logit scale, it’s unlikely to ever have any huge numbers (i.e. anything beyond ±4; recall the probability scale/logit scale plot earlier). The default brms priors for coefficients in beta regression models are flat and uniform, resulting in some potentially huge and implausible priors that lead to really bad model fit (and really slow sampling!). So we’ll help Stan a little here and explicitly tell it that the coefficients will be small (normal(0, 1)) and that must be positive (exponential(1) with a lower bound of 0).

model_beta<-brm(bf(bill_ratio~flipper_length_mm,phi~flipper_length_mm), family =Beta(), init ="0", data =penguins, prior =c(prior(normal(0, 1), class ="b"),prior(exponential(1), class ="b", dpar ="phi", lb =0)))## Start sampling

Again, we don’t care about the coefficients or marginal effects here—see this guide for more about how to work with those. Let’s instead extract these different posterior distributions of bill ratios with the three main brms functions: posterior_linpred(), posterior_epred(), and posterior_predict(). And once again, we’ll use a single value flipper length (the average, 200.967 mm) to explore these distributions.

# Make a little dataset of just the average flipper lengthpenguins_avg_flipper<-penguins|>summarize(flipper_length_mm =mean(flipper_length_mm))# Extract different types of posteriorsbeta_linpred<-model_beta|>linpred_draws(newdata =penguins_avg_flipper)beta_linpred_phi<-model_beta|>linpred_draws(newdata =penguins_avg_flipper, dpar ="phi")beta_linpred_trans<-model_beta|>linpred_draws(newdata =penguins_avg_flipper, transform =TRUE)beta_linpred_phi_trans<-model_beta|>linpred_draws(newdata =penguins_avg_flipper, dpar ="phi", transform =TRUE)beta_epred<-model_beta|>epred_draws(newdata =penguins_avg_flipper)beta_predicted<-model_beta|>predicted_draws(newdata =penguins_avg_flipper)

Notice the addition of two new posteriors here: linpred_draws(..., dpar = "phi") and linpred_draws(..., dpar = "phi", transform = TRUE). These give us the posterior distributions of the precision () distributional parameter, measured on different scales.

Importantly, for weird historical reasons, it is possible to use posterior_epred(..., dpar = "phi") to get the unlogged parameter. However, conceptually this is wrong. An epred is the expected value, or average, of the posterior predictive distribution, or . It is not the expected value of the part of the model. brms (or tidybayes) happily spits out the unlogged posterior distribution of when you use posterior_epred(..., dpar = "phi"), but it’s technically not an epred despite its name. To keep the terminology consistent, it’s best to use posterior_linpred() when working with distributional parameters, using either transform = FALSE or transform = TRUE for the logged or the unlogged scale.

Code

summary_beta_linpred<-beta_linpred|>ungroup()|>summarize(across(.linpred, lst(mean, sd, median), .names ="{.fn}"))summary_beta_linpred_phi<-beta_linpred_phi|>ungroup()|>summarize(across(phi, lst(mean, sd, median), .names ="{.fn}"))summary_beta_linpred_phi_trans<-beta_linpred_phi_trans|>ungroup()|>summarize(across(phi, lst(mean, sd, median), .names ="{.fn}"))summary_beta_epred<-beta_epred|>ungroup()|>summarize(across(.epred, lst(mean, sd, median), .names ="{.fn}"))summary_beta_predicted<-beta_predicted|>ungroup()|>summarize(across(.prediction, lst(mean, sd, median), .names ="{.fn}"))tribble(~Function, ~`Model element`, ~Values,"<code>posterior_linpred()</code>", "\\(\\operatorname{logit}(\\mu)\\) in the model", "Logits or log odds","<code>posterior_linpred(transform = TRUE)</code> or <code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\mu\\) in the model", "Probabilities",'<code>posterior_linpred(dpar = "phi")</code>', "\\(\\log(\\phi)\\) in the model", "Logged precision values",'<code>posterior_linpred(dpar = "phi", transform = TRUE)</code>', "\\(\\phi\\) in the model", "Unlogged precision values","<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Beta}(\\mu, \\phi)\\)", "Values between 0–1")|>bind_cols(bind_rows(summary_beta_linpred, summary_beta_epred, summary_beta_linpred_phi, summary_beta_linpred_phi_trans,summary_beta_predicted))|>kbl(escape =FALSE)|>kable_styling()

Random draws from posterior \(\operatorname{Beta}(\mu, \phi)\)

Values between 0–1

0.397

0.048

0.397

Neat! We have a bunch of different pieces here, all measured differently. Let’s look at all these different pieces simultaneously:

p1<-ggplot(beta_linpred, aes(x =.linpred))+stat_halfeye(fill =clrs[3])+labs(x ="Logit-scale ratio of bill depth / bill length", y =NULL, title ="**Linear predictor** <span style='font-size: 14px;'>logit(*µ*) in the model</span>", subtitle ="posterior_linpred(\n ..., tibble(flipper_length_mm = 201))\n")+theme_pred_dist()p1a<-ggplot(beta_linpred_phi, aes(x =phi))+stat_halfeye(fill =colorspace::lighten(clrs[3], 0.3))+labs(x ="Log-scale precision parameter", y =NULL, title ="**Precision parameter** <span style='font-size: 14px;'>log(*φ*) in the model</span>", subtitle ='posterior_linpred(\n ..., tibble(flipper_length_mm = 201),\n dpar = "phi")')+theme_pred_dist()p2<-ggplot(beta_epred, aes(x =.epred))+stat_halfeye(fill =clrs[2])+labs(x ="Ratio of bill depth / bill length", y =NULL, title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] or *µ* in the model</span>", subtitle ="posterior_epred(\n ..., tibble(flipper_length_mm = 201)) # or \nposterior_linpred(..., transform = TRUE)")+theme_pred_dist()p2a<-ggplot(beta_linpred_phi_trans, aes(x =phi))+stat_halfeye(fill =colorspace::lighten(clrs[2], 0.4))+labs(x ="Precision parameter", y =NULL, title ="**Precision parameter** <span style='font-size: 14px;'>*φ* in the model</span>", subtitle ='posterior_linpred(\n ..., tibble(flipper_length_mm = 201),\n dpar = "phi", transform = TRUE)\n')+theme_pred_dist()p3<-ggplot(beta_predicted, aes(x =.prediction))+stat_halfeye(fill =clrs[1])+coord_cartesian(xlim =c(0.2, 0.6))+labs(x ="Ratio of bill depth / bill length", y =NULL, title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Beta(*µ*, *φ*)</span>", subtitle ="posterior_predict()")+theme_pred_dist()layout<-"ABCCDEFFGG"p1+p1a+plot_spacer()+p2+p2a+plot_spacer()+p3+plot_layout(design =layout, heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

As with logistic regression, the results from posterior_epred() and posterior_linpred() are not identical. They still both correspond to the part of the model, but on different scales. posterior_epred() provides results on the probability or proportion scale, un-logiting and back-transforming the logit-scale results from posterior_linpred().

And once again, posterior_epred() isn’t technically the back-transformed linear predictor (if you want that, you can use posterior_linpred(..., transform = TRUE)). Instead it shows the expected values of the posterior, or , or the average of the posterior’s averages. But just like Gaussian regression and logistic regression, this average-of-averages still happens to be the same as the back-transformed , so .

We can extract the parameter by including the dpar = "phi" argument (or technically just dpar = TRUE, which returns all possible distributional parameters, which is helpful in cases with lots of them like zero-one-inflated beta regression). posterior_linpred(..., dpar = "phi", transform = TRUE) provides on the original precision scale (however that’s measured), while posterior_linpred(..., dpar = "phi") returns a log-transformed version.

And finally, the results from posterior_predict() are draws from a random beta distribution using the estimated and , and they consist of values ranging between 0 and 1.

Showing the posterior predictions for these different parameters across a range of flipper lengths will help with the intuition and illustrate the different scales, values, and parameters that these posterior functions return:

p1<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_linpred_draws(model_beta, ndraws =100)|>ggplot(aes(x =flipper_length_mm))+geom_point(data =penguins, aes(y =qlogis(bill_ratio)), size =1, alpha =0.7)+stat_lineribbon(aes(y =.linpred), .width =0.95, alpha =0.5, color =clrs[3], fill =clrs[3])+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Logit-scale ratio of\nbill depth / bill length", title ="**Linear predictor posterior** <span style='font-size: 14px;'>logit(*µ*) in the model</span>", subtitle ="posterior_linpred()")+theme_pred_range()p1a<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_linpred_draws(model_beta, ndraws =100, dpar ="phi")|>ggplot(aes(x =flipper_length_mm))+stat_lineribbon(aes(y =phi), .width =0.95, alpha =0.5, color =colorspace::lighten(clrs[3], 0.3), fill =colorspace::lighten(clrs[3], 0.3))+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Log-scale\nprecision parameter", title ="**Precision parameter** <span style='font-size: 14px;'>log(*φ*) in the model</span>", subtitle ='posterior_linpred(dpar = "phi")')+theme_pred_range()p2<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_epred_draws(model_beta, ndraws =100)|>ggplot(aes(x =flipper_length_mm))+geom_point(data =penguins, aes(y =bill_ratio), size =1, alpha =0.7)+stat_lineribbon(aes(y =.epred), .width =0.95, alpha =0.5, color =clrs[2], fill =clrs[2])+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Ratio of\nbill depth / bill length", title ="**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] or *µ* in the model</span>", subtitle ='posterior_epred()\nposterior_linpred(transform = TRUE)')+theme_pred_range()p2a<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_epred_draws(model_beta, ndraws =100, dpar ="phi")|>ggplot(aes(x =flipper_length_mm))+stat_lineribbon(aes(y =phi), .width =0.95, alpha =0.5, color =colorspace::lighten(clrs[2], 0.4), fill =colorspace::lighten(clrs[2], 0.4))+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Precision parameter", title ="**Precision parameter** <span style='font-size: 14px;'>*φ* in the model</span>", subtitle ='posterior_linpred(dpar = "phi",\n transform = TRUE)')+theme_pred_range()p3<-penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_predicted_draws(model_beta, ndraws =500)|>ggplot(aes(x =flipper_length_mm))+geom_point(data =penguins, aes(y =bill_ratio), size =1, alpha =0.7)+stat_lineribbon(aes(y =.prediction), .width =0.95, alpha =0.5, color =clrs[1], fill =clrs[1])+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Ratio of\nbill depth / bill length", title ="**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Beta(*µ*, *φ*)</span>", subtitle ="posterior_predict()")+theme_pred_range()layout<-"ABCCDEFFGG"p1+p1a+plot_spacer()+p2+p2a+plot_spacer()+p3+plot_layout(design =layout, heights =c(0.3, 0.05, 0.3, 0.05, 0.3))

So many moving parts in these distributional models! This diagram summarizes all these different posteriors, scales, and distributional parameters:

Bonus: Playing with posterior beta parameters

Before finishing with beta regression, we can play around with some of these posterior parameters to better understand what this kind of distributional model is actually doing. First, we can plot the posterior distribution using the means of the posterior and parameters instead of using the results from posterior_predict(), creating a pseudo-analytical posterior distribution. We’ll use the dprop() function from the extraDistr package instead of dbeta(), since dprop uses and instead of shape 1 and shape 2.

It’s not the greatest model at all—the actual distribution of bill ratios is bimodal (probably because of species-specific differences), but using the posterior values for and creates a distribution that picks up the average ratio.

In practice we typically don’t actually want to use these two parameters like this—we can use the results from posterior_predict() instead—but it’s cool that we can produce the same distribution with these parameters. That’s the magic of these distributional models!

mu<-summary_beta_epred$meanphi<-summary_beta_linpred_phi_trans$meanggplot(penguins, aes(x =bill_ratio))+geom_density(aes(fill ="Actual data"), color =NA)+stat_function(aes(fill =glue::glue("Beta(µ = {round(mu, 3)}, φ = {round(phi, 2)})")), geom ="area", fun =~extraDistr::dprop(., mean =mu, size =phi), alpha =0.7)+scale_fill_manual(values =c(clrs[5], clrs[1]), name =NULL)+xlim(c(0.2, 0.65))+labs(x ="Ratio of bill depth / bill length", y =NULL, title ="**Analytical posterior predictions** <span style='font-size: 14px;'>Average posterior *µ* and *φ* from the model</span>")+theme_pred_dist()+theme(legend.position =c(0, 0.9), legend.justification ="left", legend.key.size =unit(0.75, "lines"))

For even more fun, because we modeled the parameter as conditional on flipper length, it changes depending on different flipper lengths. This means that the actual posterior beta distribution is shaped differently across a whole range of lengths. Here’s what that looks like, with analytical distributions plotted at 180, 200, and 200 mm. As the precision increases, the distributions become more narrow and precise (which is also reflected in the size of the posterior_predict()-based credible intervals around the points)

muphi_to_shapes<-function(mu, phi){shape1<-mu*phishape2<-(1-mu)*phireturn(lst(shape1 =shape1, shape2 =shape2))}beta_posteriors<-tibble(flipper_length_mm =c(180, 200, 220))|>add_linpred_draws(model_beta, ndraws =500, dpar =TRUE, transform =TRUE)|>group_by(flipper_length_mm)|>summarize(across(c(mu, phi), ~mean(.)))|>ungroup()|>mutate(shapes =map2(mu, phi, ~as_tibble(muphi_to_shapes(.x, .y))))|>unnest(shapes)|>mutate(nice_label =glue::glue("Beta(µ = {round(mu, 3)}, φ = {round(phi, 2)})"))# Here are the parameters we'll use# We need to convert the mu and phi values to shape1 and shape2 so that we can# use dist_beta() to plot the halfeye distributions correctlybeta_posteriors## # A tibble: 3 × 6## flipper_length_mm mu phi shape1 shape2 nice_label ## <dbl> <dbl> <dbl> <dbl> <dbl> <glue> ## 1 180 0.485 58.1 28.2 29.9 Beta(µ = 0.485, φ = 58.1)## 2 200 0.400 104. 41.7 62.6 Beta(µ = 0.4, φ = 104.26)## 3 220 0.320 190. 61.0 129. Beta(µ = 0.32, φ = 190.3)penguins|>data_grid(flipper_length_mm =seq_range(flipper_length_mm, n =100))|>add_predicted_draws(model_beta, ndraws =500)|>ggplot(aes(x =flipper_length_mm))+geom_point(data =penguins, aes(y =bill_ratio), size =1, alpha =0.7)+stat_halfeye(data =beta_posteriors, aes(ydist =dist_beta(shape1, shape2), y =NULL), side ="bottom", fill =clrs[1], alpha =0.75)+stat_lineribbon(aes(y =.prediction), .width =0.95, alpha =0.1, color =clrs[1], fill =clrs[1])+geom_text(data =beta_posteriors, aes(x =flipper_length_mm, y =0.9, label =nice_label), hjust =0.5)+coord_cartesian(xlim =c(170, 230))+labs(x ="Flipper length (mm)", y ="Ratio of\nbill depth / bill length", title ="**Analytical posterior predictions** <span style='font-size: 14px;'>Average posterior *µ* and *φ* from the model</span>")+theme_pred_range()## Warning: Unknown or uninitialised column: `linewidth`.

When posterior_epred() isn’t just the back-transformed linear predictor

In all the examples in this guide, the results from posterior_epred() have been identical to the back-transformed results from posterior_linpred() (or posterior_linpred(..., transform = TRUE) if there are link functions). With logistic regression, posterior_epred() returned the probability-scale values of ; with beta regression, posterior_epred() returned the proportion/probability-scale values of . This is the case for many model families in Stan and brms—for mathy reasons that go beyond my skills, the average of averages is the same as the back-transformed linear predictor for lots of distributions.

This isn’t always the case though! In some families, like lognormal models, posterior_epred() and posterior_linpred(..., transform = TRUE) give different estimates. For lognormal models isn’t just one of the distribution’s parameters—it’s this:

I won’t show any examples of that here—this guide is already too long—but Matthew Kay has an example here that shows the differences between expected posterior values and back-transformed linear posterior values.

To see which kinds of families use fancier epreds, look at the source for brms::posterior_epred() here. Most of the families just use the back-transformed mu (prep\$dpars\$mu in the code), but some have special values, like lognormal’s with(prep$dpars, exp(mu + sigma^2 / 2))

tl;dr: Diagrams and cheat sheets

Keeping track of which kinds of posterior predictions you’re working with, on which scales, and for which parameters, can be tricky, especially with more complex models with lots of moving parts. To make life easier, here are all the summary diagrams in one place:

]]>rtidyverseggplotregressionbayesbrmsstanhttps://www.andrewheiss.com/blog/2022/09/26/guide-visualizing-types-posteriors/index.htmlMon, 26 Sep 2022 04:00:00 GMTQuick and easy ways to deal with long labels in ggplot2Andrew Heiss
https://www.andrewheiss.com/blog/2022/06/23/long-labels-ggplot/index.html
In one of the assignments for my data visualization class, I have students visualize the number of essential construction projects that were allowed to continue during New York City’s initial COVID shelter-in-place order in March and April 2020. It’s a good dataset to practice visualizing amounts and proportions and to practice with dplyr’s group_by() and summarize() and shows some interesting trends.

The data includes a column for CATEGORY, showing the type of construction project that was allowed. It poses an interesting (and common!) visualization challenge: some of the category names are really long, and if you plot CATEGORY on the x-axis, the labels overlap and become unreadable, like this:

library(tidyverse)# dplyr, ggplot2, and friendslibrary(scales)# Functions to format things nicely# Load pandemic construction dataessential_raw<-read_csv("https://datavizs22.classes.andrewheiss.com/projects/04-exercise/data/EssentialConstruction.csv")essential_by_category<-essential_raw%>%# Calculate the total number of projects within each categorygroup_by(CATEGORY)%>%summarize(total =n())%>%# Sort by totalarrange(desc(total))%>%# Make the category column orderedmutate(CATEGORY =fct_inorder(CATEGORY))

Ew. The middle categories here get all blended together into an unreadable mess.

Fortunately there are a bunch of different ways to fix this, each with their own advantages and disadvantages!

Option A: Make the plot wider

One quick and easy way to fix this is to change the dimensions of the plot so that there’s more space along the x-axis. If you’re using R Markdown or Quarto, you can modify the chunk options and specify fig.width:

Now the font is bigger, but the labels overlap again! We could make the figure wider again, but then we’d need to increase the font size again, and now we’re in an endless loop.

Verdict: 2/10, easy to do, but more of a quick band-aid-style solution; not super recommended.

Option B: Swap the x- and y-axes

Another quick and easy solution is to switch the x- and y-axes. If we put the categories on the y-axis, each label will be on its own line so the labels can’t overlap with each other anymore:

That works really well! However, it forces you to work with horizontal bars. If that doesn’t fit with your overall design (e.g., if you really want vertical bars), this won’t work. Additionally, if you have any really long labels, it can substantially shrink the plot area, like this:

# Make one of the labels super long for funessential_by_category%>%mutate(CATEGORY =recode(CATEGORY, "Schools"="Preschools, elementary schools, middle schools, high schools, and other schools"))%>%ggplot(aes(y =fct_rev(CATEGORY), x =total))+geom_col()+scale_x_continuous(labels =comma)+labs(y =NULL, x ="Total projects")

Verdict: 6/10, easy to do and works well if you’re happy with horizontal bars; can break if labels are too long (though long y-axis labels are fixable with the other techniques in this post too).

Option C: Recode some longer labels

Instead of messing with the width of the plot, we can mess with the category names themselves. We can use recode() from dplyr to recode some of the longer category names or add line breaks (\n) to them:

essential_by_category_shorter<-essential_by_category%>%mutate(CATEGORY =recode(CATEGORY, "Affordable Housing"="Aff. Hous.","Hospital / Health Care"="Hosp./Health","Public Housing"="Pub. Hous.","Homeless Shelter"="Homeless\nShelter"))ggplot(essential_by_category_shorter,aes(x =CATEGORY, y =total))+geom_col()+scale_y_continuous(labels =comma)+labs(x =NULL, y ="Total projects")

That works great! However, it reduces readibility (does “Aff. Hous.” mean affordable housing? affluent housing? affable housing?). It also requires more manual work and a lot of extra typing. If a new longer category gets added in a later iteration of the data, this code won’t automatically shorten it.

Verdict: 6/10, we have more control over the labels, but too much abbreviation reduces readibility, and it’s not automatic.

Option D: Rotate the labels

Since we want to avoid manually recoding categories, we can do some visual tricks to make the labels readable without changing any of the lable text. First we can rotate the labels a little. Here we rotate the labels 30°, but we could also do 45°, 90°, or whatever we want. If we add hjust = 0.5 (horizontal justification), the rotated labels will be centered in the columns, and vjust (vertical justification) will center the labels vertically.

Everything fits great now, but I’m not a big fan of angled text. I’m also not happy with the all the empty vertical space between the axis and the shorter labels like “Schools” and “Utility”. It would look a lot nicer to have all these labels right-aligned to the axis, but there’s no way easy to do that.

Verdict: 5.5/10, no manual work needed, but angled text is harder to read and there’s lots of extra uneven whitespace.

Option E: Dodge the labels

Second, instead of rotating, as of ggplot2 v3.3.0 we can automatically dodge the labels and make them offset across multiple rows with the guide_axis(n.dodge = N) function in scale_x_*():

That’s pretty neat. Again, this is all automatic and we don’t have to manually adjust any labels. The text is all horizontal so it’s more readable. But I’m not a huge fan of the gaps above the second-row labels. Maybe it would look better if the corresponding axis ticks were a little longer, idk.

Verdict: 7/10, no manual work needed, labels easy to read, but there’s extra whitespace that can sometimes feel unbalanced.

Option F: Automatically add line breaks

The easiest and quickest and nicest way to fix these long labels, though, is to use the label_wrap() function from the scales package. This will automatically add line breaks after X characters in labels with lots of text—you just have to tell it how many characters to use. The function is smart enough to try to break after word boundaries—that is, if you tell it to break after 5 characters, it won’t split something like “Approved” into “Appro” and “ved”; it’ll break after the end of the word.

Look at how the x-axis labels automatically break across lines! That’s so neat!

Verdict: 11/10, no manual work needed, labels easy to read, everything’s perfect. This is the way.

Bonus: For things that aren’t axis labels, like titles and subtitles, you can use str_wrap() from stringr to break long text at X characters (specified with width):

ggplot(essential_by_category,aes(x =CATEGORY, y =total))+geom_col()+scale_x_discrete(labels =label_wrap(10))+scale_y_continuous(labels =comma)+labs(x =NULL, y ="Total projects", title =str_wrap("Here's a really long title that will go off the edge of the figure unless it gets broken somewhere", width =50), subtitle =str_wrap("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.", width =70))

Summary

Here’s a quick comparison of all these different approaches:

Session Info

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Monterey 12.6
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-11-29
## pandoc 2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
## quarto 1.3.26 @ /usr/local/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ! package * version date (UTC) lib source
## P dplyr * 1.0.10 2022-09-01 [?] CRAN (R 4.2.0)
## P forcats * 0.5.2 2022-08-19 [?] CRAN (R 4.2.0)
## P ggplot2 * 3.4.0 2022-11-04 [?] CRAN (R 4.2.0)
## P patchwork * 1.1.2 2022-08-19 [?] CRAN (R 4.2.0)
## P purrr * 0.3.5 2022-10-06 [?] CRAN (R 4.2.0)
## P readr * 2.1.3 2022-10-01 [?] CRAN (R 4.2.0)
## P scales * 1.2.1 2022-08-20 [?] CRAN (R 4.2.0)
## P sessioninfo * 1.2.2 2021-12-06 [?] CRAN (R 4.2.0)
## P stringr * 1.4.1 2022-08-20 [?] CRAN (R 4.2.0)
## P tibble * 3.1.8 2022-07-22 [?] CRAN (R 4.2.0)
## P tidyr * 1.2.1 2022-09-08 [?] CRAN (R 4.2.0)
## P tidyverse * 1.3.2 2022-07-18 [?] CRAN (R 4.2.0)
##
## [1] /Users/andrew/Sites/ath-quarto/renv/library/R-4.2/aarch64-apple-darwin20
## [2] /Users/andrew/Sites/ath-quarto/renv/sandbox/R-4.2/aarch64-apple-darwin20/84ba8b13
##
## P ── Loaded and on-disk path mismatch.
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

]]>rtidyverseggplotdata visualizationhttps://www.andrewheiss.com/blog/2022/06/23/long-labels-ggplot/index.htmlThu, 23 Jun 2022 04:00:00 GMTMarginalia: A guide to figuring out what the heck marginal effects, marginal slopes, average marginal effects, marginal effects at the mean, and all these other marginal things areAndrew Heiss
https://www.andrewheiss.com/blog/2022/05/20/marginalia/index.html

Diagrams!

You can download PDF, SVG, and PNG versions of the marginal effects diagrams in this guide, as well as the original Adobe Illustrator file, here:

According to David, researchers typically view their work like this:

People work towards a final published product, which is the most valuable output of the whole process. The intermediate steps like the code, data, preliminary results, and so on, are less valuable and often hidden from the public. People only see the final published thing.

David argues that we should instead see our work like this:

In this paradigm, anything on your computer and only accessible by you isn’t that valuable. Anything you make accessible to the public online—including all the intermediate stuff like code, data, and preliminary results, in addition to the final product—is incredibly valuable. The world can benefit from neat code tricks you stumble on while making graphs; the world can benefit from new data sources you find or your way of processing data; the world can benefit from a toy example of a new method you read about in some paper, even if the actual code you write to play around with the method never makes it into any published paper. It’s all useful to the broader community of researchers.

Public work also builds community norms—if more people share their behind-the-scenes work, it encourages others to do the same and engage with it and improve it (see this super detailed and helpful comment with corrections to my previous post, for example!).

Public work is also valuable for another more selfish reason. Building an online presence with a wide readership is hard, and my little blog post contributions aren’t famous or anything—they’re just sitting out here in a tiny corner of the internet. But these guides have been indispensable for me. They’ve allowed me to work through and understand tricky statistical and programming concepts, and then have allowed me to come back to them months later and remember how they work. This whole blog is primarily a resource for future me.

So here’s yet another blog post that is hopefully potentially useful for the general public, but that is definitely useful for future me.

Part of the reason for this wrongness is because there are so many quasi-synonyms for the idea of “marginal effects” and people seem to be pretty loosey goosey about what exactly they’re referring to. There are statistical effects, marginal effects, marginal means, marginal slopes, conditional effects, conditional marginal effects, marginal effects at the mean, and many other similarly-named ideas. There are also regression coefficients and estimates, which have marginal effects vibes, but may or may not actually be marginal effects depending on the complexity of the model.

The question of what the heck “marginal effects” are has plagued me for a while. In October 2021 I publicly announced that I would finally buckle down and figure out their definitions and nuances:

And then I didn’t.

So here I am, 7 months later, publicly figuring out the differences between regression coefficients, regression predictions, marginaleffects, emmeans, marginal slopes, average marginal effects, marginal effects at the mean, and all these other “marginal” things that researchers and data scientists use.

This guide is highly didactic and slowly builds up the concept of marginal effects as slopes and partial derivatives. The tl;dr section at the end has a useful summary of everything here, with a table showing all the different approaches to marginal effects with corresponding marginaleffects and emmeans code, as well as some diagrams outlining the two packages’ different approaches to averaging. Hopefully it’s useful—it is for me!

Let’s get started by looking at some lines and slopes (after loading a bunch of packages and creating some useful little functions).

# Load packages# ---------------library(tidyverse)# dplyr, ggplot2, and friendslibrary(broom)# Convert models to data frameslibrary(marginaleffects)# Marginal effects stufflibrary(emmeans)# Marginal effects stuff# Visualization-related packageslibrary(ggtext)# Add markdown/HTML support to text in plotslibrary(glue)# Python-esque string interpolationlibrary(scales)# Functions to format numbers nicelylibrary(gganimate)# Make animated plotslibrary(patchwork)# Combine ggplotslibrary(ggrepel)# Make labels that don't overlaplibrary(MetBrewer)# Artsy color palettes# Data-related packageslibrary(palmerpenguins)# Penguin datalibrary(WDI)# Get data from the World Bank's APIlibrary(countrycode)# Map country codes to different systemslibrary(vdemdata)# Use data from the Varieties of Democracy (V-Dem) project# Install vdemdata from GitHub, not CRAN# devtools::install_github("vdeminstitute/vdemdata")# Helpful functions# -------------------# Format numbers in pretty waysnice_number<-label_number(style_negative ="minus", accuracy =0.01)nice_p<-label_pvalue(prefix =c("p < ", "p = ", "p > "))# Point-slope formula: (y - y1) = m(x - x1)find_intercept<-function(x1, y1, slope){intercept<-slope*(-x1)+y1return(intercept)}# Visualization settings# ------------------------# Custom ggplot theme to make pretty plots# Get IBM Plex Sans Condensed at https://fonts.google.com/specimen/IBM+Plex+Sans+Condensedtheme_mfx<-function(){theme_minimal(base_family ="IBM Plex Sans Condensed")+theme(panel.grid.minor =element_blank(), plot.background =element_rect(fill ="white", color =NA), plot.title =element_text(face ="bold"), axis.title =element_text(face ="bold"), strip.text =element_text(face ="bold"), strip.background =element_rect(fill ="grey80", color =NA), legend.title =element_text(face ="bold"))}# Make labels use IBM Plex Sans by defaultupdate_geom_defaults("label", list(family ="IBM Plex Sans Condensed"))update_geom_defaults(ggtext::GeomRichText, list(family ="IBM Plex Sans Condensed"))update_geom_defaults("label_repel", list(family ="IBM Plex Sans Condensed"))# Use the Johnson color paletteclrs<-met.brewer("Johnson")

What does “marginal” even mean in the first place?

Put as simply as possible, in the world of statistics, “marginal” means “additional,” or what happens to outcome variable when explanatory variable changes a little.

To find out precisely how much things change, we need to use calculus.

Oh no.

Super quick crash course in differential calculus (it’s not scary, I promise!)

I haven’t taken a formal calculus class since my senior year of high school in 2002. I enjoyed it a ton and got the highest score on the AP Calculus BC test, which gave me enough college credits to not need it as an undergraduate, given that I majored in Middle East Studies, Arabic, and Italian. I figured I’d never need to think about calculus every again. lol.

In my first PhD-level stats class in 2012, the professor cancelled class for the first month and assigned us all to go relearn calculus with Khan Academy, since I wasn’t alone in my unlearning of calculus. Even after that crash course refresher, I don’t really ever use it in my own research. When I do, I only use it to think about derivatives and slopes, since those are central to statistics.

Calculus can be boiled down to two forms: (1) differential calculus is all about finding rates of changes by calculating derivatives, or slopes, while (2) integral calculus is all about finding total amounts, or areas, by adding infinitesimally small things together. According to the fundamental theorem of calculus, these two types are actually the inverse of each other—you can find the total area under a curve based on its slope, for instance. Super neat stuff. If you want a cool accessible refresher / history of all this, check out Steven Strogatz’s Infinite Powers: How Calculus Reveals the Secrets of the Universe—it’s great.

In the world of statistics and marginal effects all we care about are slopes, which are solely a differential calculus idea.

Let’s pretend we have a line that shows the relationship between and that’s defined with an equation using the form , where is the slope and is the y-intercept. We can plot it with ggplot using the helpful geom_function() function:

# y = 2x - 1a_line<-function(x)(2*x)-1ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_function(fun =a_line, size =1, color =clrs[2])+scale_x_continuous(breaks =-2:5, limits =c(-1, 3))+scale_y_continuous(breaks =-3:9)+annotate(geom ="segment", x =1, y =1.3, xend =1, yend =3, color =clrs[4], size =0.5)+annotate(geom ="segment", x =1, y =3, xend =1.8, yend =3, color =clrs[4], size =0.5)+annotate(geom ="richtext", x =1.4, y =3.1, label ="Slope: **2**", vjust =0)+labs(x ="x", y ="y")+coord_equal()+theme_mfx()

The line crosses the y-axis at -1, and its slope, or its is 2, or , meaning that we go up two units and to the right one unit.

Importantly, the slope shows the relationship between and . If increases by 1 unit, increases by 2: when is 1, is 1; when is 2, is 3, and so on. We can call this the marginal effect, or the change in that results from one additional .

We can think about this slope using calculus language too. In differential calculus, slopes are called derivatives and they represent the change in that results from changes in , or . The here refers to an infinitesimal change in the values of and , rather than a one-unit change like we think of when looking at the slope as . Even more technically, the indicates that we’re working with the total derivative, since there’s only one variable () to consider. If we had more variables (like ), we would need to find the partial derivative for , holding constant, and we’d write the derivative with a symbol instead: . More on that in a bit.

By plotting this line, we can figure out visually—the slope is 2. But we can figure it out mathematically too. Differential calculus is full of fancy tricks and rules of thumb for figuring out derivatives, like the power rule, the chain rule, and so on. The easiest one for me to remember is the power rule, which says you can find the slope of a variable like by decreasing its exponent by 1 and multiplying that exponent by the variable’s coefficient. All constants (terms without ) disappear.

(My secret is that I only know the power rule and so I avoid calculus at all costs and either use R or use Wolfram Alpha—go to Wolfram Alpha, type in derivative y = 2x - 1 and you’ll see some magic.)

We thus know that the derivative of is . At every point on this line, the slope is 2—it never changes.

slope_annotations<-tibble(x =c(-0.25, 1.2, 2.4))|>mutate(y =a_line(x))|>mutate(nice_y =y+1)|>mutate(nice_label =glue("x: {x}; y: {y}<br>","Slope (dy/dx): **{2}**"))ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_function(fun =a_line, size =1, color =clrs[2])+geom_point(data =slope_annotations, aes(x =x, y =y))+geom_richtext(data =slope_annotations, aes(x =x, y =y, label =nice_label), nudge_y =0.5)+scale_x_continuous(breaks =-2:5, limits =c(-1, 3))+scale_y_continuous(breaks =-3:9)+labs(x ="x", y ="y")+coord_equal()+theme_mfx()

The power rule seems super basic for equations with non-exponentiated s, but it’s really helpful with more complex equations, like this parabola :

# y = -0.5x^2 + 5x + 5a_parabola<-function(x)(-0.5*x^2)+(5*x)+5ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_function(fun =a_parabola, size =1, color =clrs[2])+xlim(-5, 15)+labs(x ="x", y ="y")+coord_cartesian(ylim =c(-5, 20))+theme_mfx()

What’s interesting here is that there’s no longer a single slope for the whole function. The steepness of the slope across a range of s depends on whatever currently is. The curve is steeper at really low and really high values of and it is shallower around 5 (and it is completely flat when is 5).

If we apply the power rule to the parabola formula we can find the exact slope:

When is 0, the slope is 5 (); when is 8, the slope is −3 (), and so on. We can visualize this if we draw some lines tangent to some different points on the equation. The slope of each of these tangent lines represents the instantaneous slope of the parabola at each value.

# dy/dx = -x + 5parabola_slope<-function(x)(-x)+5slope_annotations<-tibble( x =c(0, 3, 8))|>mutate(y =a_parabola(x), slope =parabola_slope(x), intercept =find_intercept(x, y, slope), nice_slope =glue("Slope (dy/dx)<br><span style='font-size:12pt;color:{clrs[4]}'>**{slope}**</span>"))ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_function(fun =a_parabola, size =1, color =clrs[2])+geom_abline(data =slope_annotations,aes(slope =slope, intercept =intercept), size =0.5, color =clrs[4], linetype ="21")+geom_point(data =slope_annotations, aes(x =x, y =y), size =3, color =clrs[4])+geom_richtext(data =slope_annotations, aes(x =x, y =y, label =nice_slope), nudge_y =2)+xlim(-5, 15)+labs(x ="x", y ="y")+coord_cartesian(ylim =c(-5, 20))+theme_mfx()

And here’s an animation of what the slope looks like across a whole range of s. Neat!

In the calculus world, the term “marginal” isn’t used all that often. Instead they talk about derivatives. But in the end, all these marginal/derivative things are just slopes.

Before looking at how this applies to the world of statistics, let’s look at a quick example from economics, since economists also use the word “marginal” to refer to slopes. My first exposure to the word “marginal” meaning “changes in things” wasn’t actually in the world of statistics, but in economics. I took my first microeconomics class as a first-year MPA student in 2010 (and hated it; ironically I teach it now 🤷).

One common question in microeconomics relates to how people maximize their happiness, or utility, under budget constraints (see here for an R-based example). Economists imagine that people have utility functions in their heads that take inputs and convert them to utility (or happiness points). For instance, let’s pretend that the happiness/utility () you get from the number of cookies you eat () is defined like this:

Here’s what that looks like:

# u = -0.5x^2 + 5xu_cookies<-function(x)(-0.5*x^2)+(5*x)ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_function(fun =u_cookies, size =1, color =clrs[2])+scale_x_continuous(breaks =seq(0, 12, 2), limits =c(0, 12))+labs(x ="Cookies", y ="Utility (happiness points)")+theme_mfx()

This parabola represents your total utility from cookies. Eat 1 cookie, get 4.5 happiness points; eat 3 cookies, get 10.5 points; eat 6, get 12 points; and so on.

The marginal utility, on the other hand, tells how how much more happiness you’d get from eating one more cookie. If you’re currently eating 1, how many more happiness points would you get by moving to 2? If if you’re eating 7, what would happen to your happiness if you moved to 8? We can figure this out by looking at the slope of the parabola, which will show us the instantaneous rate of change, or marginal utility, for any number of cookies.

# du/dx = -x + 5mu_cookies<-function(x)-x+5ggplot()+geom_vline(xintercept =0, size =0.5, color ="grey50")+geom_hline(yintercept =0, size =0.5, color ="grey50")+geom_vline(xintercept =5, size =0.5, linetype ="21", color =clrs[3])+geom_function(fun =mu_cookies, size =1, color =clrs[5])+scale_x_continuous(breaks =seq(0, 12, 2), limits =c(0, 12))+labs(x ="Cookies", y ="Marginal utility (additional happiness points)")+theme_mfx()

If you’re currently eating 1 cookie and you grab another one, you’ll gain 4 extra or marginal happiness points. If you’re eating 6 and you grab another one, you’ll actually lose some happiness—the marginal utility at 6 is -1. If you’re an economist who wants to maximize your happiness, you should eat the number of cookies where the extra happiness you’d get is 0, or where marginal utility is 0:

Eat 5 cookies, maximize your happiness. Eat any more and you’ll start getting disutility (like a stomachache). This is apparent in the marginal utility plot too. All the values of marginal utility to the left of 5 are positive; all the values to the right of 5 are negative. Economists call this decreasing marginal utility.

This relationship between total utility and marginal utility is even more apparent if we look at both simultaneously (for fun I included the second derivative (), or the slope of the first derivative, in the marginal utility panel):

Marginal utility, marginal revenue, marginal costs, and all those other marginal things are great for economists, but how does this “marginal” concept relate to statistics? Is it the same?

Statistics is all about lines, and lines have slopes, or derivatives. These slopes represent the marginal changes in an outcome. As you move an independent/explanatory variable , what happens to the dependent/outcome variable ?

Regression, sliders, switches, and mixing boards

Before getting into the mechanics of statistical marginal effects, it’s helpful to review what exactly regression coefficients are doing in statistical models, especially when dealing with both continuous and categorical explanatory variables.

When I teach statistics to my students, my favorite analogy for regression is to think of sliders and switches. Sliders represent continuous variables: as you move them up and down, something gradual happens to the resulting light. Switches represent categorical variables: as you turn them on and off, there are larger overall changes to the resulting light.

Let’s look at some super tiny quick models to illustrate this, using data from palmerpenguins:

Disregard the intercept for now and just look at the coefficients for flipper_length_mm and species*. Flipper length is a continuous variable, so it’s a slider—as flipper length increases by 1 mm, penguin body mass increases by 50 grams. Slide it up more and you’ll see a bigger increase: if flipper length increases by 10 mm, body mass should increase by 500 grams. Slide it down for fun too! If flipper length decreases by 1 mm, body mass decreases by 50 grams. Imagine it like a sliding light switch.

Species, on the other hand, is a switch. There are three possible values here: Adelie, Chinstrap, and Gentoo. The base case in the results here is Adelie since it comes fist alphabetically. The coefficients for speciesChinstrap and speciesGentoo aren’t sliders—you can’t talk about one-unit increases in Gentoo-ness or Chinstrap-ness. Instead, the values show what happens in relation to the average weight of Adelie penguins if you flip the Chinstrap or Gentoo switch. Chinstrap penguins are 29 grams heavier than Adelie penguins on average, while the chonky Gentoo penguins are 1.4 kg heavier than Adellie penguins. With these categorical coefficients, we’re flipping a switch on and off: Adelie vs. Chinstrap and Adelie vs. Gentoo.

This slider and switch analogy holds when thinking about multiple regression too, though we need to think of lots of sliders and switches, like in an audio mixer board:

With a mixer board, we can move many different sliders up and down and use different combinations of switches, all of which ultimately influence the audio output.

Let’s make a more complex mixer-board-esque regression model with multiple continuous (slider) and categorical (switch) explanatory variables:

Interpreting these coefficients is a little different now, since we’re working with multiple moving parts. In regular stats class, you’ve probably learned to say something like “Holding all other variables constant, a 1 mm increase in flipper length is associated with a 17.5 gram increase in body mass, on average” (slider) or “Holding all other variables constant, Chinstrap penguins are 79 grams lighter than Adelie penguins, on average” (switch).

This idea of “holding everything constant” though can be tricky to wrap your head around. Imagining this model like a mixer board can help, though. Pretend that you set the bill depth slider to some value (0, the average, whatever), you flip the Chinstrap and Gentoo switches off, you flip the male switch off, and then you slide only the flipper length switch up and down. You’d be looking at the marginal effect of flipper length for female Adelie penguins with an average (or 0 or whatever) length of bill depth. Stop moving the flipper length slider and start moving the bill depth slider and you’ll see the marginal effect of bill depth for female Adelie penguins. Flip on the male switch and you’ll see the marginal effect of bill depth for male Adelie penguins. Flip on the Gentoo switch and you’ll see the marginal effect of bill depth for male Gentoo penguins. And so on.

In calculus, if you have a model like model_slider with just one continuous variable, the slope or derivative of that variable is the total derivative, or . If you have a model like model_mixer with lots of other variables, the slope or derivative of any of the individual explanatory variables is the partial derivative, or , where all other variables are held constant.

What are marginal effects?

Oops. When talking about these penguin regression results up there ↑ I used the term “marginal effect,” but we haven’t officially defined it in the statistics world yet. It’s tricky to do that, though, because there are so many synonyms and near synonyms for the idea of a statistical effect, like marginal effect, marginal mean, marginal slope, conditional effect, conditional marginal effect, and so on.

Formally defined, a marginal effect is a partial derivative from a regression equation. It’s the instantaneous slope of one of the explanatory variables in a model, with all the other variables held constant. If we continue with the mixing board analogy, it represents what would happen to the resulting audio levels if we set all sliders and switches to some stationary level and we moved just one slider up a tiny amount.

However, in practice, people use the term “marginal effect” to mean a lot more than just a partial derivative. For instance, in a randomized controlled trial, the difference in group means between the treatment and control groups is often called a marginal effect (and sometimes called a conditional effect, or even a conditional marginal effect). The term is also often used to talk about other group differences, like differences in penguin weights across species.

In my mind, all these quasi-synonymous terms represent the same idea of a statistical effect, or what would happen to an outcome if one of the explanatory variables (be it continuous, categorical, or whatever) were different. The more precise terms like marginal effect, conditional effect, marginal mean, and so on, are variations on this theme. This is similar to how a square is a rectangle, but a rectangle is not a square—they’re all super similar, but with minor subtle differences depending on the type of we’re working with:

Marginal effect: the statistical effect for continuous explanatory variables; the partial derivative of a variable in a regression model; the effect of a single slider

Conditional effect or group contrast: the statistical effect for categorical explanatory variables; the difference in means when a condition is on vs. when it is off; the effect of a single switch

Slopes and marginal effects

Let’s look at true marginal effects, or the partial derivatives of continuous variables in a model (or sliders, in our slider/switch analogy). For the rest of this post, we’ll move away from penguins and instead look at some cross-national data about the relationship between public sector corruption, the legal requirement to disclose donations to political campaigns, and respect for human rights, since that’s all more related to what I do in my own research (I know nothing about penguins). We’ll explore two different political science/policy questions:

What is the relationship between a country’s respect for civil liberties and its level of public sector corruption? Do countries that respect individual human rights tend to have less corruption too?

Does a country’s level of public sector corruption influence whether it has laws that require campaign finance disclosure? How does corruption influence a country’s choice to be electorally transparent?

We’ll use data from the World Bank and from the Varieties of Democracy project and just look at one year of data (2020) so we don’t have to worry about panel data. There’s a great R package for accessing V-Dem data without needing to download it manually from their website, but it’s not on CRAN—it has to be installed from GitHub.

V-Dem and the World Bank have hundreds of different variables, but we only need a few, and we’ll make a few adjustments to the ones we do need. Here’s what we’ll do:

Main continuous outcome and continuous explanatory variable: Public sector corruption index (v2x_pubcorr in V-Dem). This is a 0–1 scale that measures…

To what extent do public sector employees grant favors in exchange for bribes, kickbacks, or other material inducements, and how often do they steal, embezzle, or misappropriate public funds or other state resources for personal or family use?

Higher values represent worse corruption.

Main binary outcome: Disclosure of campaign donations (v2eldonate_ord in V-Dem). This is an ordinal variable with these possible values:

0: No. There are no disclosure requirements.

1: Not really. There are some, possibly partial, disclosure requirements in place but they are not observed or enforced most of the time.

2: Ambiguous. There are disclosure requirements in place, but it is unclear to what extent they are observed or enforced.

3: Mostly. The disclosure requirements may not be fully comprehensive (some donations not covered), but most existing arrangements are observed and enforced.

4: Yes. There are comprehensive requirements and they are observed and enforced almost all the time.

For the sake of simplicity, we’ll collapse this into a binary variable. Countries have disclosure laws if they score a 3 or a 4; they don’t if they score a 0, 1, or 2.

Other continuous explanatory variables:

Electoral democracy index, or polyarchy (v2x_polyarchy in V-Dem): a continuous variable measured from 0–1 with higher values representing greater achievement of democratic ideals

Civil liberties index (v2x_civlib in V-Dem): a continuous variable measured from 0–1 with higher values representing better respect for human rights and civil liberties

Log GDP per capita (NY.GDP.PCAP.KD at the World Bank): GDP per capita in constant 2015 USD

Region: V-Dem provides multiple regional variables with varying specificity (19 different regions, 10 different regions, and 6 different regions). We’ll use the 6-region version (e_regionpol_6C) for simplicity here:

1: Eastern Europe and Central Asia (including Mongolia)

2: Latin America and the Caribbean

3: The Middle East and North Africa (including Israel and Turkey, excluding Cyprus)

4: Sub-Saharan Africa

5: Western Europe and North America (including Cyprus, Australia and New Zealand)

6: Asia and Pacific (excluding Australia and New Zealand)

# Get data from the World Bank's APIwdi_raw<-WDI(country ="all", indicator =c(population ="SP.POP.TOTL", gdp_percapita ="NY.GDP.PCAP.KD"), start =2000, end =2020, extra =TRUE)# Clean up the World Bank datawdi_2020<-wdi_raw|>filter(region!="Aggregates")|>filter(year==2020)|>mutate(log_gdp_percapita =log(gdp_percapita))|>select(-region, -status, -year, -country, -lastupdated, -lending)# Get data from V-Dem and clean it upvdem_2020<-vdem%>%select(country_name, country_text_id, year, region =e_regionpol_6C, disclose_donations_ord =v2eldonate_ord, public_sector_corruption =v2x_pubcorr, polyarchy =v2x_polyarchy, civil_liberties =v2x_civlib)%>%filter(year==2020)%>%mutate(disclose_donations =disclose_donations_ord>=3, disclose_donations =ifelse(is.na(disclose_donations), FALSE, disclose_donations))%>%# Scale these up so it's easier to talk about 1-unit changesmutate(across(c(public_sector_corruption, polyarchy, civil_liberties), ~.*100))|>mutate(region =factor(region, labels =c("Eastern Europe and Central Asia","Latin America and the Caribbean","Middle East and North Africa","Sub-Saharan Africa","Western Europe and North America","Asia and Pacific")))# Combine World Bank and V-Dem data into a single datasetcorruption<-vdem_2020|>left_join(wdi_2020, by =c("country_text_id"="iso3c"))|>drop_na(gdp_percapita)glimpse(corruption)## Rows: 168## Columns: 17## $ country_name <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana",…## $ country_text_id <chr> "MEX", "SUR", "SWE", "CHE", "GHA", "ZAF", "JPN", "MMR",…## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…## $ region <fct> Latin America and the Caribbean, Latin America and the …## $ disclose_donations_ord <dbl> 3, 1, 2, 0, 2, 1, 3, 2, 3, 2, 2, 0, 3, 3, 4, 3, 4, 2, 1…## $ public_sector_corruption <dbl> 48.8, 24.8, 1.3, 1.4, 65.2, 57.1, 3.7, 36.8, 70.6, 71.2…## $ polyarchy <dbl> 64.7, 76.1, 90.8, 89.4, 72.0, 70.3, 83.2, 43.6, 26.2, 4…## $ civil_liberties <dbl> 71.2, 87.7, 96.9, 94.8, 90.4, 82.2, 92.8, 56.9, 43.0, 8…## $ disclose_donations <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, T…## $ iso2c <chr> "MX", "SR", "SE", "CH", "GH", "ZA", "JP", "MM", "RU", "…## $ population <dbl> 1.29e+08, 5.87e+05, 1.04e+07, 8.64e+06, 3.11e+07, 5.93e…## $ gdp_percapita <dbl> 8923, 7530, 51542, 85685, 2021, 5659, 34556, 1587, 9711…## $ capital <chr> "Mexico City", "Paramaribo", "Stockholm", "Bern", "Accr…## $ longitude <chr> "-99.1276", "-55.1679", "18.0645", "7.44821", "-0.20795…## $ latitude <chr> "19.427", "5.8232", "59.3327", "46.948", "5.57045", "-2…## $ income <chr> "Upper middle income", "Upper middle income", "High inc…## $ log_gdp_percapita <dbl> 9.10, 8.93, 10.85, 11.36, 7.61, 8.64, 10.45, 7.37, 9.18…

Let’s start off by looking at the effect of civil liberties on public sector corruption by using a really simple model with one explanatory variable:

plot_corruption<-corruption|>mutate(highlight =civil_liberties==min(civil_liberties)|civil_liberties==max(civil_liberties))ggplot(plot_corruption, aes(x =civil_liberties, y =public_sector_corruption))+geom_point(aes(color =highlight))+stat_smooth(method ="lm", formula =y~x, size =1, color =clrs[1])+geom_label_repel(data =filter(plot_corruption, highlight==TRUE), aes(label =country_name), seed =1234)+scale_color_manual(values =c("grey30", clrs[3]), guide ="none")+labs(x ="Civil liberties index", y ="Public sector corruption index")+theme_mfx()

We have a nice fitted OLS line here with uncertainty around it. What’s the marginal effect of civil liberties on public sector corruption? What kind of calculus and math do we need to do to find it? Not much, happily!

In general, we have a regression formula here that looks a lot like the stuff we were using before, only now the intercept is and the slope is . If we use the power rule to find the first derivative of this equation, we’ll see that the slope of the entire line is :

If we add actual coefficients from the model into the formula we can see that the coefficient for civil_liberties (−0.80) is indeed the marginal effect:

The coefficient by itself is thus enough to tell us what the effect of moving civil liberties around is—it is the marginal effect of civil liberties on public sector corruption. Slide the civil liberties index up by 1 point and public sector corruption will be −0.80 points lower, on average.

Importantly, this is only the case because we’re using simple linear regression without any curvy parts. If your model is completely linear without any polynomials or logs or interaction terms or doesn’t use curvy regression families like logistic or beta regression, you can use individual coefficients as marginal effects.

Let’s see what happens when we add curves. We’ll add a polynomial term, including both civil_liberties and civil_liberties^2 so that we can capture the parabolic shape of the relationship between civil liberties and corruption:

ggplot(plot_corruption, aes(x =civil_liberties, y =public_sector_corruption))+geom_point(aes(color =highlight))+stat_smooth(method ="lm", formula =y~x+I(x^2), size =1, color =clrs[2])+geom_label_repel(data =filter(plot_corruption, highlight==TRUE), aes(label =country_name), seed =1234)+scale_color_manual(values =c("grey30", clrs[3]), guide ="none")+labs(x ="Civil liberties index", y ="Public sector corruption index")+theme_mfx()

This is most likely not a great model fit in real life, but using the quadratic term here makes a neat curved line, so we’ll go with it for the sake of the example. But don’t, like, make any policy decisions based on this line.

When working with polynomials in regression, the coefficients appear and work a little differently:

We now have two coefficients for civil liberties: and . Importantly, we cannot use just one of these to talk about the marginal effect of changing civil liberties. A one-point increase in the civil liberties index is not associated with a 1.58 increase or a −0.02 decrease in corruption. The slope of the fitted line now comprises multiple moving parts: (1) the coefficient for the non-squared term, (2) the coefficient for the squared term, and (3) some value of civil liberties, since the slope isn’t the same across the whole line. The math shows us why and how.

We have terms for both and in our model. To find the derivative, we can use the power rule to get rid of the term (), but the in the term doesn’t disappear (). The slope of the line thus depends on both the βs and the :

Here’s what that looks like with the results of our civil liberties and corruption model:

Because the actual slope depends on the value of civil liberties, we need to plug in different values to get the instantaneous slopes at each value. Let’s plug in 25, 55, and 80, for fun:

# Extract the two civil_liberties coefficientsciv_lib1<-tidy(model_sq)|>filter(term=="civil_liberties")|>pull(estimate)civ_lib2<-tidy(model_sq)|>filter(term=="I(civil_liberties^2)")|>pull(estimate)# Make a little function to do the mathciv_lib_slope<-function(x)civ_lib1+(2*civ_lib2*x)civ_lib_slope(c(25, 55, 80))## [1] 0.594 -0.587 -1.572

We have three different slopes now: 0.59, −0.59, and −1.57 for civil liberties of 25, 55, and 80, respectively. We can plot these as tangent lines:

tangents<-model_sq|>augment(newdata =tibble(civil_liberties =c(25, 55, 80)))|>mutate(slope =civ_lib_slope(civil_liberties), intercept =find_intercept(civil_liberties, .fitted, slope))|>mutate(nice_label =glue("Civil liberties: {civil_liberties}<br>","Fitted corruption: {nice_number(.fitted)}<br>","Slope: **{nice_number(slope)}**"))ggplot(corruption, aes(x =civil_liberties, y =public_sector_corruption))+geom_point(color ="grey30")+stat_smooth(method ="lm", formula =y~x+I(x^2), size =1, se =FALSE, color =clrs[4])+geom_abline(data =tangents, aes(slope =slope, intercept =intercept), size =0.5, color =clrs[2], linetype ="21")+geom_point(data =tangents, aes(x =civil_liberties, y =.fitted), size =4, shape =18, color =clrs[2])+geom_richtext(data =tangents, aes(x =civil_liberties, y =.fitted, label =nice_label), nudge_y =-7)+labs(x ="Civil liberties index", y ="Public sector corruption index")+theme_mfx()

Doing the calculus by hand here is tedious though, especially once we start working with even more covariates in a model. Plus we don’t have any information about uncertainty, like standard errors and confidence intervals. There are official mathy ways to figure those out by hand, but who even wants to do that. Fortunately there are two different packages that let us find marginal slopes automatically, with important differences in their procedures, which we’ll explore in detail below. But before looking at their differences, let’s first see how they work.

First, we can use the marginaleffects() function from marginaleffects to see the slope (the dydx column here) at various levels of civil liberties. We’ll look at the mechanics of this function in more detail in the next section—for now we’ll just plug in our three values of civil liberties and see what happens. We’ll also set the eps argument: behind the scenes, marginaleffects() doesn’t actually do the by-hand calculus of piecing together first derivatives—instead, it calculates the fitted value of corruption when civil liberties is a value, calculates the fitted value of corruption when civil liberties is that same value plus a tiny bit more, and then subtracts them. The eps value controls that tiny amount. In this case, it’ll calculate the predictions for civil_liberties = 25 and civil_liberties = 25.001 and then find the slope of the tiny tangent line between those two points. It’s a neat little mathy trick to avoid calculus.

Second, we can use the emtrends() function from emmeans to also see the slope (the civil_liberties.trend column here) at various levels of civil liberties. The syntax is different (note the delta.var argument instead of eps), but the results are essentially the same:

Both marginaleffects() and emtrends() also helpfully provide uncertainty, with standard errors and confidence intervals, with a lot of super fancy math behind the scenes to make it all work. marginaleffects() provides p-values automatically; if you want p-values from emtrends() you need to wrap it in test():

Another neat thing about these more automatic functions is that we can use them to create a marginal effects plot, placing the value of the slope on the y-axis rather than the fitted value of public corruption. marginaleffects helpfully has plot_cme() that will plot the values of dydx across the whole range of civil liberties automatically. Alternatively, if we want full control over the plot, we can use either marginaleffects() or emtrends() to create a data frame that we can plot ourselves with ggplot:

# Automatic plot from marginaleffects::plot_cme()mfx_marginaleffects_auto<-plot_cme(model_sq, effect ="civil_liberties", condition ="civil_liberties")+labs(x ="Civil liberties", y ="Marginal effect of civil liberties on public sector corruption", subtitle ="Created automatically with marginaleffects::plot_cme()")+theme_mfx()# Piece all the geoms together manually with results from marginaleffects::marginaleffects()mfx_marginaleffects<-model_sq|>marginaleffects(newdata =datagrid(civil_liberties =seq(min(corruption$civil_liberties), max(corruption$civil_liberties), 0.1)), eps =0.001)|>ggplot(aes(x =civil_liberties, y =dydx))+geom_vline(xintercept =42, color =clrs[3], size =0.5, linetype ="24")+geom_ribbon(aes(ymin =conf.low, ymax =conf.high), alpha =0.1, fill =clrs[1])+geom_line(size =1, color =clrs[1])+labs(x ="Civil liberties", y ="Marginal effect of civil liberties on public sector corruption", subtitle ="Calculated with marginaleffects()")+theme_mfx()# Piece all the geoms together manually with results from emmeans::emtrends()mfx_emtrends<-model_sq|>emtrends(~civil_liberties, var ="civil_liberties", at =list(civil_liberties =seq(min(corruption$civil_liberties), max(corruption$civil_liberties), 0.1)), delta.var =0.001)|>as_tibble()|>ggplot(aes(x =civil_liberties, y =civil_liberties.trend))+geom_vline(xintercept =42, color =clrs[3], size =0.5, linetype ="24")+geom_ribbon(aes(ymin =lower.CL, ymax =upper.CL), alpha =0.1, fill =clrs[1])+geom_line(size =1, color =clrs[1])+labs(x ="Civil liberties", y ="Marginal effect of civil liberties on public sector corruption", subtitle ="Calculated with emtrends()")+theme_mfx()mfx_marginaleffects_auto|mfx_marginaleffects|mfx_emtrends

This kind of plot is useful since it shows precisely how the effect changes across civil liberties. The slope is 0 at around 42, positive before that, and negative after that, which—assuming this is a good model and who even knows if that’s true—implies that countries with low levels of respect for civil liberties will see an increase in corruption as civil liberties increases, while countries with high respect for civil liberties will see a decrease in corruption as they improve their respect for human rights.

marginaleffects’s and emmeans’s philosophies of averaging

Finding marginal effects for lines like and with calculus is fairly easy since there’s no uncertainty involved. Finding marginal effects for fitted lines from a regression model, on the other hand, is more complicated because uncertainty abounds. The estimated partial slopes all have standard errors and measures of statistical significance attached to them. The slope of civil liberties at 55 is −0.59, but it could be higher and it could be lower. Could it even possibly be zero? Maybe! (But most likely not; the p-value that we saw above is less than 0.001, so there’s only a sliver of a chance of seeing a slope like −0.59 in a world where it is actually 0ish).

We deal with the uncertainty of these marginal effects by taking averages, which is why we talk about “average marginal effects” when interpreting these effects. So far, marginaleffects::marginaleffects() and emmeans::emtrends() have given identical results. But behind the scenes, these packages take two different approaches to calculating these marginal averages. The difference is very subtle, but incredibly important.

Let’s look at how these two packages calculate their marginal effects by default.

Average marginal effects (the default in marginaleffects)

By default, marginaleffects calculates the average marginal effect (AME) for its partial slopes/coefficients. To do this, it follows a specific process of averaging:

It first plugs each row of the original dataset into the model and generates predictions for each row. It then uses fancy math (i.e. adding 0.001) to calculate the instantaneous slope for each row and stores each individual slope in the dydx column here:

It finally calculates the average of the dydx column. We can do that ourselves:

mfx_sq|>group_by(term)|>summarize(avg_dydx =mean(dydx))## # A tibble: 1 × 2## term avg_dydx## <chr> <dbl>## 1 civil_liberties -1.17

Or we can feed a marginaleffects object to summary() or tidy(), which will calculate the correct uncertainty statistics, like the standard errors:

summary(mfx_sq)## Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %## 1 civil_liberties -1.17 0.0948 -12.3 <2e-16 -1.35 -0.98## ## Model type: lm ## Prediction type: response

Note that the average marginal effect here isn’t the same as what we saw before when we set civil liberties to different values. In this case, the effect is averaged across the whole range of civil liberties—one single grand average mean. It shows that in general, the overall average slope of the fitted line is −1.17.

Don’t worry about the number too much here—we’re just exploring the underlying process of calculating this average marginal effect. In general, as the image shows above, for average marginal effects, we take the full original data, feed it to the model, generate fitted values for each original row, and then collapse the results into a single value.

The main advantage of doing this is that each dydx prediction uses values that exist in the actual data. The first dydx slope estimate is for Mexico in 2020 and is based on Mexico’s actual value of civil_liberties (and any other covariates if we had included any others in the model). It’s thus more reflective of reality.

Marginal effects at the mean (the default in emmeans)

A different approach for this averaging is to calculate the marginal effect at the mean, or MEM. This is what the emmeans package does by default. (The emmeans package actually calculates two average things: “marginal effects at the means” (MEM), or average slopes using emtrends(), and “estimated marginal means” (EMM), or average predictions using emmeans(). It’s named after the second of these, hence the name emmeans).

To do this, we follow a slightly different process of averaging:

First, we calculate the average value of each of the covariates in the model (in this case, just civil_liberties):

Because of rounding (and because the values are so tiny), this looks like the two rows are identical, but they’re not—the second one really is 0.001 more than 69.682.

We then subtract the two and divide by 0.001 to get the final marginal effect at the mean:

That doesn’t give us any standard errors or uncertainty or anything, so it’s better to use emtrends() or marginaleffects(). emtrends() calculates this MEM automatically:

We can also calculate the MEM with marginaleffects() if we include the newdata = "mean" argument, which will automatically shrink the original data down into average or typical values:

model_sq|>marginaleffects(newdata ="mean")|>summary()## Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %## 1 civil_liberties -1.17 0.0948 -12.3 <2e-16 -1.35 -0.98## ## Model type: lm ## Prediction type: response

The disadvantage of this approach is that no actual country has a civil_liberties score of exactly 69.682. If we had other covariates in the model, no country would have exactly the average of every variable. The marginal effect is thus calculated based on a hypothetical country that might not possibly exist in real life.

Where this subtle difference really matters

So far, comparing average marginal effects (AME) with marginal effects at the mean (MEM) hasn’t been that useful, since both marginaleffects() and emtrends() provided nearly identical results with our simple model with civil liberties squared. That’s because nothing that strange is going on in the model—there are no additional explanatory variables, no interactions or logs, and we’re using OLS and not anything fancy like logistic regression or beta regression.

Things change once we leave the land of OLS.

Let’s make a new model that predicts if a country has campaign finance disclosure laws based on public sector corruption. Disclosure laws is a binary outcome, so we’ll use logistic regression to constrain the fitted values and predictions to between 0 and 1.

Even without any squared terms, we’re already in non-linear land. We can build a model and explore this relationship:

model_logit<-glm(disclose_donations~public_sector_corruption, family =binomial(link ="logit"), data =corruption)tidy(model_logit)## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.98 0.388 5.09 3.51e- 7## 2 public_sector_corruption -0.0678 0.00991 -6.84 7.85e-12

The coefficients here are on a different scale and are measured in log odds units (or logits), not probabilities or percentage points. That means we can’t use those coefficients directly. We ca