references.bib
BibTeX file):

title: "Some title"
bibliography: references.bib

According to @Lovelace:1842, computers can calculate things. This was important
during World War II [@Turing:1936].
And it’ll convert to this after running the document through pandoc:
This is all great and ideal when working with documents that have a single bibliography at the end.
Some documents—like course syllabuses and readings lists—don’t have a final bibliography. Instead they have lists of things people should read. However, if you try to insert citations like normal, you’ll get the inline references and a final bibliography:

title: "Some course syllabus"
bibliography: references.bib

## Course schedule
### Week 1
 [@Lovelace:1842]
 [@Turing:1936]
### Week 2
 [@Keynes:1937]
The full citations are all in the document, but not in a very convenient location. Readers have to go to the back of the document to see what they actually need to read (especially if there’s a website or DOI URL they need to click on).
It would be great if the full citation could be included in the lists in the document instead of at the end of the document.
And it’s possible, with just a minor tweak to the Citation Style Language (CSL) style file that you’re using (thanks to adam.smith at StackOverflow for pointing out how).
By default pandoc uses Chicago authordate for bibiliographic references—hence the (Lovelace 1842)
style of references. You can download any other CSL file from Zotero’s searchable style repository, from the Citation Styles project’s searchable list, or clone the full massive GitHub repository of styles to find others, like Chicago notes, APA, MLA, and so on.
The easiest way to get full citations inline is to find a CSL that uses notebased citations, like the Chicago full note style and edit the CSL file to tell it to be an inline style instead of a note style.
The second line of all CSL files contains a <style>
XML element with a class
attribute. Inline styles like APA and Chicago author date have class="intext"
:
<?xml version="1.0" encoding="utf8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" class="intext" version="1.0" demotenondroppingparticle="displayandsort" pagerangeformat="chicago">
<info>
<title>Chicago Manual of Style 17th edition (authordate)</title>
...
…while notebased styles like Chicago notes have class="note"
:
<?xml version="1.0" encoding="utf8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" class="note" version="1.0" demotenondroppingparticle="displayandsort" pagerangeformat="chicago">
<info>
<title>Chicago Manual of Style 17th edition (full note)</title>
...
If you download a notebased CSL style and manually change it to be intext
, the footnotes that it inserts will get inserted in the text itself instead of as foonotes.
Here I downloaded Chicago full note, edited the second line to say class="intext"
, and saved it as chicagosyllabus.csl
:
<?xml version="1.0" encoding="utf8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" class="intext" version="1.0" demotenondroppingparticle="displayandsort" pagerangeformat="chicago">
<info>
<title>Chicago Manual of Style 17th edition (full note, but intext)</title>
...
I can then tell pandoc to use that CSL when rendering the document:

title: "Some course syllabus"
bibliography: references.bib
csl: chicagosyllabus.csl

## Course schedule
### Week 1
 [@Lovelace:1842]
 [@Turing:1936]
### Week 2
 [@Keynes:1937]
…and the full references are included in the document itself!
This isn’t quite perfect, though. There are three glaring problems with this:
We have a bibliography at the end, since Chicago notesbibliography requires it. This makes sense for regular documents where you have footnotes throughout the body of the text with a list of references at the end, but it’s not necessary here.
The intext references all have hyperlinks to their corresponding references in the final bibliography. We don’t need those since the linked text is the bibliography.
If you render this in Quarto, you get helpful popups that contain the full reference when you hover over the link. But again, the link is the full reference, so that extra hover information is redundant.
All these problems are easy to fix with some additional YAML settings that suppress the final bibliography, turn off citation links, and disable Quarto’s hovering:

title: "Some course syllabus"
bibliography: references.bib
csl: chicagosyllabus.csl
suppressbibliography: true
linkcitations: false
citationshover: false

## Course schedule
### Week 1
 [@Lovelace:1842]
 [@Turing:1936]
### Week 2
 [@Keynes:1937]
Perfect!
This is all great and super easy if you (like me) are fond of Chicago. What if you want to use APA, though? Or MLA? Or any other style that doesn’t use footnotes?
For APA, you’re in luck! There’s an APA (curriculum vitae) CSL style that you can use, and you don’t need to edit it beforehand—it just works:

title: "Some course syllabus with APA"
bibliography: references.bib
csl: apacv.csl
suppressbibliography: true
linkcitations: false
citationshover: false

## Course schedule
### Week 1
 [@Lovelace:1842]
 [@Turing:1936]
### Week 2
 [@Keynes:1937]
For any other style though, you’re (somewhat) out of luck. The simple trick of switching class="note"
to class="intext"
doesn’t work if the underlying style is already intext like APA or Chicago authordate. You’d have to do some major editing and rearranging in the CSL file to force the bibliography entries to show up as inline citations, which goes beyond my skills.
As a workaround you can use the {RefManageR} package in R to read the bibliography file with R and output the bibliography part of the citations as Markdown. Steve Miller has a helpful guide for this here.
When I started my first master’s degree program in 2008, I decided to stop using Word for all my academic writing and instead use plain text Markdown for everything. Markdown itself had been a thing for 4 years, and MultiMarkdown—a pandoclike extension of Markdown that could handle BibTeX bibliographies—was brand new. I did all my writing for my courses and my thesis in Markdown and converted it all to PDF through LaTeX using MultiMarkdown. I didn’t know about pandoc yet, so I only ever converted to PDF, not HTML or Word.
I stored all my bibliographic references in a tiny little references.bib
BibTeX file that I managed with BibDesk. BibDesk is a wonderful and powerful program with an active developer community and it does all sorts of neat stuff like autofiling PDFs, importing references from DOIs, searching for references on the internet from inside the program, and just providing a nice overall front end for dealing with BibTeX files.
I kept using my MultiMarkdown + LaTeX output system throughout my second master’s degree, and my references.bib
file and PDF database slowly grew. R Markdown hadn’t been invented yet and I still hadn’t discovered pandoc, so living in a mostly LaTeXbased world was fine.
When I started my PhD in 2012, something revolutionary happened: the {knitr} package was invented. The new R Markdown format let you to mix R code with Markdown text and create multiple outputs (HTML, LaTeX, and docx) through pandoc. I abandoned MultiMarkdown and fully converted to pandoc (thanks also in part to Kieran Healy’s Plain Person’s Gide to Plain Text Social Science). Since 2012, I’ve written exclusively in pandocflavored Markdown and always make sure that I can convert everything to PDF, HTML, and Word (see the “Manuscript” entry in the navigation bar here, for instance, where you can download the preprint version of that paper in a ton of different formats). I recently converted a bunch of my output templates to Quarto pandoc too.
During all this time, I didn’t really keep up with other reference managers. I used super early Zotero as an undergrad back in 2006–2008, but it didn’t fit well with my Markdownbased workflow, so I kind of ignored it. I picked it up again briefly at the beginning of my PhD, but I couldn’t get it to play nicely with R Markdown and pandoc, so I kept using trusty old BibDesk. My references.bib
file got bigger and bigger as I took more and more doctoral classes and did more research, but BibDesk handled the growing library just fine. As of today, I’ve got 1,400 items in there with nearly 1,000 PDFs, and everything still works great—mostly.
BibDesk got me through my dissertation and all my research projects up until now, so why consider switching away to some other system? Over the past few years, as I’ve done more reading on my iPad and worked on more coauthored projects, I’ve run into a few pain points in my citation workflow.
I enjoy reading PDFs on my iPad (particularly in the iAnnotate app), but getting PDFs from BibDesk onto the iPad has always required a bizarre dance:
references.bib
and the BibDeskmanaged folder of PDFs in DropboxI’d often get sick of this convoluted process and just find the PDF on my computer and AirDrop it to my iPad directly, completely circumventing Dropbox. I’d then AirDrop it back to my computer and attach the marked up PDF to the reference in BibDesk. It’s inconvenient, but less inconvenient than bouncing around a bunch of different apps and hoping everything works.
BibTeX works great with LaTeX. That’s why it was invented in the first place! The fact that things like pandoc work with it is partially a historical accident—.bib
files were a convenient and widely used plain text bibliography format, so pandoc and MultiMarkdown used BibTeX for citations.
But citations are often more complicated than BibTeX can handle. Consider the LaTeX package biblatexchicago—in order to be fully compliant with all the intricacies of the Chicago Manual of Style, it has to expand the BibTeX (technically BibLaTeX) format to include fields like entrysubtype
for distinguishing between magazine/newspaper articles and journal articles, among dozens of other customizations and tweaks. BibTeX has a limited set of entry types, and anything that’s not one of those types gets shoehorned into the misc
type.
Internally, programs like pandoc that can read BibTeX files convert them into a standard Citation Style Language (CSL) format, which it then uses to format references as Chicago, APA, MLA, or whatever. It would be great to store all my citations in a CSLcompliant format in the first place rather than as a LaTeXonly format that has to be constantly converted onthefly when converting to any nonLaTeX output.
Zotero conveniently fixes all these issues:
It has a synchronization service that works across platforms (including iOS). It can work with Dropbox too if you don’t want to be bound by their file size limit or pay for extra storage, though I ended up paying for storage to (1) support open source software and (2) not have to deal with multiple programs. I’ve been doing the BibDesk → iAnnotate → Dropbox → MacBook → AirDrop dance for too many years—I just want Zotero to handle all the syncing for me.
It’s super easy to collaborate with Zotero. You can create shared group libraries with different sets of coauthors and not worry about Dropbox synchronization issues or accidental deletion of }
characters in the .bib
file. For one of my readingintensive class, I’ve even created a shared Zotero group library that all the students can join and cite from, which is neat.
It’s also far easier to maintain a master list of references. You can create a Zotero collection for specific projects, and items can live in multiple collections. Editing an item in one collection updates that item in all other collections. Zotero treats collections like iTunes/Apple Music playlists—just like songs can belong to multiple playlists, bibliographic entries can belong to multiple collections.
Zotero follows the CSL standard that pandoc uses. It was the first program to adopt CSL (way back in 2006!). It supports all kinds of entry types and fields, beyond what BibTeX supports.
Migrating my big .references.bib
file to Zotero was a relatively straightforward process, but it required a few minor shenanigans to get everything working right.
Preparing everything for migration meant I had to make a ton of edits to the original references.bib
file, so I made a copy of it first and worked with the copy.
To make Zotero work nicely with a pandoccentric writing workflow, and to make file management and tag management easier, I installed these three extensions:
BibDesk allows you to add a couple extra metadata fields to entries for ratings and to mark them as read. I’ve used these fields for years and find them super useful for keeping track of how much I like articles and for remembering which ones I’ve actually finished.
Internally, BibDesk stores this data as entries in the raw BibTex:
@article{the_citekey_for_this_entry,
author = {Whoever},
title = {Whatever},
...
rating = {4},
read = {1}}
These fields are preserved and transferred to Zotero when you import the file, but they show up in the “Extra” field and aren’t easily filterable or sortable there:
I decided to treat these as Zotero tags, which BibDesk calls keywords. I considered making some sort of programmatic solution and writing a script to convert all the rating
and read
fields to keywords
, but that seemed like too much work—many entries have existing keywords and parsing the file and concatenating ratings and read status to the list of keywords would be hard.
So instead I sorted all my entries in BibDesk by rating, selected all the 5 star ones and added a zzzzz
tag, selected all the 4 star ones and added a zzzz
tag, and so on (so that 1 star entries got a z
) tag. I then sorted the entries by read status and assigned xxx
to all the ones I’ve read. These tag names were just temporary—in Zotero I changed these to emojis (⭐️⭐️⭐️ and ✅), but because I was worried about transferring complex Unicode characters like emojis across programs, I decided to simplify things by temporarily just using ASCII characters.
BibDesk can autofile attached PDFs and manage their location. To keep track of where the files are, it stores their path as a base64encoded path in a bdskfileN
field in the .bib
file, like this:
@article{HeissKelley:2017,
author = {Andrew Heiss and Judith G. Kelley},
doi = {10.1086/691218},
journal = {Journal of Politics},
month = {4},
number = {2},
pages = {73241},
title = {Between a Rock and a Hard Place: International {NGOs} and the Dual Pressures of Donors and Host Governments},
volume = {79},
year = {2017},
bdskfile1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhZYWxpYXNEYXRhXxBcUGFwZXJzL0hlaXNzS2VsbGV5MjAxNyAtIEJldHdlZW4gYSBSb2NrIGFuZCBhIEhhcmQgUGxhY2UgSW50ZXJuYXRpb25hbCBOR09zIGFuZCB0aGUgRHVhbC5wZGZPEQJ8AAAAAAJ8AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAADfgQ51QkQAAf////8fSGVpc3NLZWxsZXkyMDE3IC0gI0ZGRkZGRkZGLnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/////9T5sk0AAAAAAAAAAAABAAMAAAogY3UAAAAAAAAAAAAAAAAABlBhcGVycwACAHwvOlVzZXJzOmFuZHJldzpEcm9wYm94OlJlYWRpbmdzOlBhcGVyczpIZWlzc0tlbGxleTIwMTcgLSBCZXR3ZWVuIGEgUm9jayBhbmQgYSBIYXJkIFBsYWNlIEludGVybmF0aW9uYWwgTkdPcyBhbmQgdGhlIER1YWwucGRmAA4ArABVAEgAZQBpAHMAcwBLAGUAbABsAGUAeQAyADAAMQA3ACAALQAgAEIAZQB0AHcAZQBlAG4AIABhACAAUgBvAGMAawAgAGEAbgBkACAAYQAgAEgAYQByAGQAIABQAGwAYQBjAGUAIABJAG4AdABlAHIAbgBhAHQAaQBvAG4AYQBsACAATgBHAE8AcwAgAGEAbgBkACAAdABoAGUAIABEAHUAYQBsAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgB6VXNlcnMvYW5kcmV3L0Ryb3Bib3gvUmVhZGluZ3MvUGFwZXJzL0hlaXNzS2VsbGV5MjAxNyAtIEJldHdlZW4gYSBSb2NrIGFuZCBhIEhhcmQgUGxhY2UgSW50ZXJuYXRpb25hbCBOR09zIGFuZCB0aGUgRHVhbC5wZGYAEwABLwAAFQACAA3//wAAAAgADQAaACQAgwAAAAAAAAIBAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAMD}}
Zotero doesn’t parse that gnarly field—it needs a field named file
—and it doesn’t decode that messy string into a plain text file path, so the attached PDF won’t get imported correctly.
However, thanks to Emiliano Heyns, the Better BibTeX addon will automatically convert these base64encoded paths to plain text fields that Zotero can work with just fine. All PDFs will import automatically!
I wanted all the PDFs that Zotero would manage to have nice predictable filenames. In BibDesk, I used this pattern:
citekey  First few words of title.pdf
That’s been fine, but it uses spaces in the file name and doesn’t remove any punctuation or special characters, so it was a little trickier to work with in the terminal or with scripts or for easy consistent searching (especially when searching in the iPad Dropbox app when looking for a PDF to read). But because I set up that pattern in 2008, path dependency kind of locked me in and I’ve been unwilling to change it since.
Since I’m starting with a whole new reference manager, I figured it was time to adopt a better PDF naming system. In the ZotFile preferences, I set this pattern:
{%a}{%y}{%t}
…which translates to
up_to_three_last_namesyearfirst_few_characters_of_title.pdf
(see this for a list of all the possible wildcards)
…with 
separating the three logical units (authors, year, title), and _
separating all the words within each unit (which follows Jenny Bryan’s principles of file naming). In practice, the pattern looks like this:
heiss_kelley2017between_a_rock_and_a_hard_place.pdf
I had to tweak a few other renaming settings too. Here’s the final set of preferences:
I wanted to switch the roles of 
and _
and do
heisskelley_2017_betweenarockandahardplace.pdf
…but Zotero and/or ZotFile seems to hardwire _
as the space replacement in its titles. Oh well.
In BibDesk, I’ve had a citation key pattern that I’ve used for years: Lastname:Year
, with up to three last names for coauthored things, and an incremental lowercase letter in the case of duplicates:
HeissKelley:2017
HeissKelley:2017a
Imbens:2021
LundbergJohnsonStewart:2021
Zotero and Better BibTeX preserve citekeys when you import a .bib
file, but I wanted to make sure I keep using this system for new items I add going forward, so I changed the Better BibTeX preferences to use the same pattern:
auth(0,1) + auth(0,2) + auth(0,3) + ":" + year
With all that initial prep work done, I imported the .bib
file into my Zotero library (File > Import…). I made sure “Place imported collections and items to new collection” was checked and that files were copied to the Zotero storage folder:
The Tags panel in Zotero then showed all the project/classspecific keywords from BibDesk, in addition to the ratings and read status tags I added previously:
I renamed each of the zzz*
rating tags to use emoji stars and renamed the xxx
read tag to use ✅.
Zotero has the ability to assign tags specific colors and pin them in a specific order, which also makes the tags display in the main Zotero library list. Following advice from the Zotero Tag extension, I pinned the read status ✅ tag as the first tag, the 5star rating as the second tag, the 4star rating as the third tag, and so on.
Now the read status and ratings tags are easily accessible and appear directly in the main Zotero library list!
incollection
/ inbook
and crossref
BibDesk natively supports the crossref
field, which biber and biblatex use when working with LaTeX. This field lets you set up child/parent relationships with items, where children inherit fields from their parents. For instance, consider these two items—an edited book with lots of chapters from different authors and a chapter from that book:
@inbook{ElHusseiniToeplerSalamon:2004,
author = {Hashem ElHusseini and Stefan Toepler and Lester M. Salamon},
chapter = {12},
crossref = {SalamonSokolowski:2004},
pages = {22732},
title = {Lebanon}}
@book{SalamonSokolowski:2004,
address = {Bloomfield, CT},
editor = {Lester M. Salamon and S. Wojciech Sokolowski},
publisher = {Kumarian Press},
title = {Global Civil Society: Dimensions of the Nonprofit Sector},
volume = {2},
year = {2004}}
In BibDesk, the chapter displays like this:
Fields like book title, publisher, year, etc., are all greyed out because they’re inherited from the parent book, with the citekey SalamonSokolowski:2004
If you install version 6.7.47+ of the Better BibTeX addon, the chapter will inherit all the information from its parent book—the book title, date, publisher, etc., will all be imported correctly:
And with that, I have a complete version of my 15yearold references.bib
file inside Zotero!
Part of the reason I’ve been hesitant to switch away from BibDesk for so long is because I couldn’t figure out a way to connect a Markdown document to my Zotero database. With documents that get parsed through pandoc (like R Markdown or Quarto), you add a line in the YAML front matter to specify what file contains your references:

title: Whatever
author: Whoever
bibliography: references.bib

Since Zotero keeps everything in one big database, I didn’t see a way to add something like bibliography: My Zotero Database
to the YAML front matter—pandoc requires that you point to a plain text file like .bib
or .json
or .yml
, not a Zotero database.
However, the magical Better BibTeX addon clarified everything for me and makes it super easy to point pandoc at a single file that contains a collection of reference items.
.bib
fileFirst, create a collection of items that you want to cite in your writing project. Since collections are like playlists and items can belong to multiple collections, there’s no need to manage duplicate entries or anything (like I was running into with Problem 2 above).
Right click on the collection name and choose “Export collection…”.
Change the format to “Better BibLaTeX”, check “Keep updated”, and choose a place to save the resulting .bib
file.
The “Keep updated” option is the magical part of this whole thing. If you add an item or edit an existing item in the collection in Zotero, Better BibTeX will automatically reexport the collection to the .bib
file. You can have one central repository of citations and lots of dynamically updated plain text .bib
files that you don’t have to edit or keep track of. Truly magical.
.qmd
/ .Rmd
/ .md
to the exported fileYou’ll now have a .bib
file that contains all the references that you can cite. Put that filename in your front matter (use .json
or .yml
if you export the file as JSON or YAML instead):

title: Whatever
author: Whoever
bibliography: name_of_file_you_exported_from_zotero.bib

Cite things like normal.
Because the front matter is pointed at a plain text .bib
file that contains all the bibliographic references, it’ll generate the citations correctly. And because Better BibTeX is configured to automatically update the exported plain text file, any changes you make in Zotero will automatically be reflected. Again, this is magic.
Alternatively, if you write in RStudio, you can connect RStudio to your Zotero database and have it do a similar autoexport thing. You can also tell it to use Better BibTeX to keep things automatically synced:
(See here for more details about Zotero citations in RStudio)
One extra nice thing about using RStudio is its fancy Insert Citation dialog, which makes adding citations in Markdown just like adding citations in Word or Google Docs. It only works in the Visual Markdown Editor, though, which I don’t normally use, so I just use Better BibTeX alone rather than RStudio’s Zotero connection when I write in RStudio.
trans_breaks()
and trans_format()
functions used there are superceded and deprecated).
So here’s a quick overview of how to use 2022era {scales} to adjust axis breaks and labels to use both base 10 logs and natural logs. I’ll use data from the Gapminder project, since it has a nice exponentiallydistributed measure of GDP per capita.
The distribution of GDP per capita is heavily skewed, with most countries reporting less than $10,000. As a result, the scatterplot makes an upsidedown L shape. Try sticking a regression line on that and you’ll get in trouble.
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
guides(color = "none") +
labs(title = "GDP per capita",
subtitle = "Original nonlogged values")
ggplot comes with a builtin scale_x_log10()
to transform the xaxis into logged values. It will automatically create pretty, logical breaks based on the data. Here, the breaks automatically go from 300 → 1000 → 3000 → 10000, and so on:
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() +
guides(color = "none") +
labs(title = "GDP per capita, log base 10",
subtitle = "scale_x_log10()") +
theme(panel.grid.minor = element_blank())
If we want to be mathy about the labels, we can format them as base 10 exponents using label_log()
:
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10(labels = label_log(digits = 2)) +
guides(color = "none") +
labs(title = "GDP per capita, log base 10",
subtitle = "scale_x_log10() with exponentiated labels") +
theme(panel.grid.minor = element_blank())
What if we don’t want the default 300, 1000, 3000, etc. breaks? In the interactive plot at gapminder.org, the breaks start at 500 and double after that: 500, 1000, 2000, 4000, 8000, etc. We can control our axis breaks by feeding a list of numbers to scale_x_log10()
with the breaks
argument. Instead of typing out every possible break, we can generate a list of numbers starting at 500 and then doubling (, , , and so on):
500 * 2^seq(0, 8, by = 1)
## [1] 500 1000 2000 4000 8000 16000 32000 64000 128000
For bonus fun, we’ll format the breaks as dollars and use the newasof{scales}1.2.0 cut_short_scale()
to shorten the values:
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10(breaks = 500 * 2^seq(0, 9, by = 1),
labels = label_dollar(scale_cut = cut_short_scale())) +
guides(color = "none") +
labs(title = "GDP per capita, log base 10",
subtitle = "scale_x_log10() + more logical breaks") +
theme(panel.grid.minor = element_blank())
Log base 10 makes sense for visualizing things. Seeing the jumps from $500 → $1000 → $2000 is generally easy for people to understand (especially in today’s world of exponentially growing COVID cases). When working with logged values for statistical modeling, analysts prefer to use the natural log, or log base instead.
The default logging function in R, log()
, calculates the natural log (you have to use log10()
or log(base = 10)
to get base 10 logs).
Plotting natural logged values is a little trickier than base 10 values, since ggplot doesn’t have anything like scale_x_log_e()
. But it’s still doable.
First, we can log the value on our own and just use the default scale_x_continuous()
for labeling:
ggplot(gapminder_2007, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point() +
guides(color = "none") +
labs(title = "GDP per capita, natural log (base e)",
subtitle = "GDP per capita logged manually")
Those 6, 7, 8, etc. breaks in the xaxis represent the power is raised to, like and . We can format these labels as exponents to make that clearer:
ggplot(gapminder_2007, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point() +
scale_x_continuous(labels = label_math(e^.x)) +
guides(color = "none") +
labs(title = "GDP per capita, natural log (base e)",
subtitle = "GDP per capita logged manually, exponentiated labels")
To get these labels, we have to prelog GDP per capita. We didn’t need to prelog the varialb when using scale_x_log10()
, since that logs things for us. We can have the scale_x_*()
function handle the natural logging for us too by specifying trans = log_trans()
:
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_continuous(trans = log_trans()) +
guides(color = "none") +
labs(title = "GDP per capita, natural log (base e)",
subtitle = "trans = log_trans()")
Everything is logged as expected, but those labels are gross—they’re , , and , but on the dollar scale:
We can format these breaks as based exponents instead with label_math()
(with the format = log
argument to make the formatting function log the values first):
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_continuous(trans = log_trans(),
# This breaks_log() thing happens behind the scenes and
# isn't strictly necessary here
# breaks = breaks_log(base = exp(1)),
labels = label_math(e^.x, format = log)) +
guides(color = "none") +
labs(title = "GDP per capita, natural log (base e)",
subtitle = "trans = log_trans(), exponentiated labels")
If we want more breaks than 7, 9, 11, we can feed the scaling function a list of exponentiated breaks:
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_continuous(trans = log_trans(),
breaks = exp(6:11),
labels = label_math(e^.x, format = log)) +
guides(color = "none") +
labs(title = "GDP per capita, natural log (base e)",
subtitle = "trans = log_trans(), exponentiated labels, custom breaks")
Take, for instance, the term “fixed effects.” In econometrics and other social scienceflavored statistics, this typically refers to categorical terms in a regression model. Like, if we run a model like this with gapminder data…
…we can say that we’ve added “country fixed effects.”
That’s all fine and good until we come to the world of hierarchical or multilevel models, which has its own issues with nomenclature and can’t decide what to even call itself:
If we fit a model like this with countrybased offsets to the intercept…
…then we get to say that there are “country random effects” or “country group effects”, while gdpPercap
is actually a “fixed effect” or “populationlevel effect”
“Fixed effects” in multilevel models aren’t at all the same as “fixed effects” in econometricsland.
Wild.
Another confusing term is the idea of “marginal effects.” One common definition of marginal effects is that they are slopes, or as the {marginaleffects} vignette says…
…partial derivatives of the regression equation with respect to each variable in the model for each unit in the data.
There’s a whole R package ({marginaleffects}) dedicated to calculating these, and I have a whole big long guide about this. Basically marginal effects are the change in the outcome in a regression model when you move one of the explanatory variables up a little while holding all other covariates constant.
But there’s also another definition (seemingly?) unrelated to the idea of partial derivatives or slopes! And once again, it’s a key part of the multilevel model world. I’ve run into it many times when reading about multilevel models (and I’ve even kind of alluded to it in past blog posts like this), but I’ve never fully understood what multilevel marginal effects are and how they’re different from slopebased marginal effects.
In multilevel models, you can calculate both marginal effects and conditional effects. Neither are necessarily related to slopes (though they both can be). They’re often mixed up. Even {brms} used to have a function named marginal_effects()
that they’ve renamed to conditional_effects()
.
I’m not alone in my inability to remember the difference between marginal and conditional effects in multilevel models, it seems. Everyone mixes these up. TJ Mahr recently tweeted about the confusion:
TJ studies language development in children and often works with data with repeated child subjects. His typical models might look something like this, with observations grouped by child:
tj_model < lmer(y ~ x1 + x2 + (1  child),
data = whatever)
His data has childbased clusters, since individual children have repeated observations over time. We can find two different kinds of effects given this type of multilevel model: we can look at the effect of x1
or x2
in one typical child, or we can look at the effect of x1
or x2
across all children on average. The confusinglynamed terms “conditional effect” and “marginal effect” refer to each of these “flavors” of effect:
If we have country random effects like (1  country)
like I do in my own work, we can calculate the same two kinds of effects. Imagine a multilevel model like this:
Or more formally,
With this model, we can look at two different types of effects:
gdpPercap
() in an average or typical country (where the random country offset is 0)gdpPercap
( again) across all countries (where the random country offset is dealt with… somehow…)This conditional vs. marginal distinction applies to any sort of hierarchical structure in multilevel models:
Calculating these different effects can be tricky, even with OLSlike normal or Gaussian regression, and interpreting them can get extra complicated with generalized linear mixed models (GLMMs) where we use links like Poisson, negative binomial, logistic, or lognormal families. The math with GLMMs gets complicated—particularly with lognormal models. Kristoffer Magnusson has several incredible blog posts that explore the exact math behind each of these effects in a lognormal GLMM.
Vincent ArelBundock’s magisterial {marginaleffects} R package can calculate both conditional and marginal effects automatically. I accidentally stumbled across the idea of multilevel marginal and conditional effects in an earlier blog post, but there I did everything with {emmeans} rather than {marginaleffects}, and as I explore here, {marginaleffects} is great for calculating average marginal effects (AMEs) rather than marginal effects at the mean (MEMs). Also in that earlier guide, I don’t really use this “conditional” vs. “marginal” distinction and just end up calling everything marginal. So everything here is more in line with the seemingly standard multilevel model ideas of “conditional” and “marginal” effects.
Let’s load some libraries, use some neat colors and a nice ggplot theme, and get started.
library(tidyverse)
library(brms)
library(tidybayes)
library(marginaleffects)
library(broom.mixed)
library(kableExtra)
library(scales)
library(ggtext)
library(patchwork)
# Southern Utah colors
clrs < NatParksPalettes::natparks.pals("BryceCanyon")
# Custom ggplot themes to make pretty plots
# Get Noto Sans at https://fonts.google.com/specimen/Noto+Sans
theme_nice < function() {
theme_bw(base_family = "Noto Sans") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
legend.title = element_text(face = "bold"))
}
To make sure I’ve translated Magnusson’s math into the corresponding (and correct) {marginaleffects} syntax, I recreate his analysis here. He imagines some sort of intervention or treatment that is designed to reduce the amount of dollars lost in gambling each week (). The individuals in this situation are grouped into some sort of clusters—perhaps neighborhoods, states, or countries, or even the same individuals over time if we have repeated longitudinal observations. The exact kind of cluster doesn’t matter here—all that matters is that observations are nested in groups, and those groups have their own specific characteristics that influence individuallevel outcomes. In this simulated data, there are 20 clusters, with 30 individuals in each cluster, with 600 total observations.
To be more formal about the structure, we can say that every outcome gets two subscripts for the cluster () and person inside each cluster (). We thus have where and . The nested, hierarchical, multilevel nature of the data makes the structure look something like this:
I’ve included Magnusson’s original code for generating this data here, but you can also download an .rds
version of it here, or use the URL directly with readr::read_rds()
:
d < readr::read_rds("https://www.andrewheiss.com/blog/2022/11/29/conditionalmarginalmarginaleffects/df_example_lognormal.rds")
#' Generate lognormal data with a random intercept
#'
#' @param n1 patients per cluster
#' @param n2 clusters per treatment
#' @param B0 log intercept
#' @param B1 log treatment effect
#' @param sd_log log sd
#' @param u0 SD of log intercepts (random intercept)
#'
#' @return a data.frame
gen_data < function(n1, n2, B0, B1, sd_log, u0) {
cluster < rep(1:(2 * n2), each = n1)
TX < rep(c(0, 1), each = n1 * n2)
u0 < rnorm(2 * n2, sd = u0)[cluster]
mulog < (B0 + B1 * TX + u0)
y < rlnorm(2 * n1 * n2, meanlog = mulog, sdlog = sd_log)
d < data.frame(cluster,
TX,
y)
d
}
set.seed(4445)
pars < list("n1" = 30, # observations per cluster
"n2" = 10, # clusters per treatment
"B0" = log(500),
"B1" = log(0.5),
"sd_log" = 0.5,
"u0" = 0.5)
d < do.call(gen_data,
pars)
The model of the effect of on gambling losses for individuals nested in clusters can be written formally like this, with cluster specific offsets to the intercept term (i.e. , or cluster random effects):
We can fit this model with {brms} (or lme4::lmer()
if you don’t want to be Bayesian):
fit
## Family: lognormal
## Links: mu = identity; sigma = identity
## Formula: y ~ 1 + TX + (1  cluster)
## Data: dat (Number of observations: 600)
## Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
## total postwarmup draws = 16000
##
## GroupLevel Effects:
## ~cluster (Number of levels: 20)
## Estimate Est.Error l95% CI u95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.63 0.12 0.45 0.92 1.00 2024 3522
##
## PopulationLevel Effects:
## Estimate Est.Error l95% CI u95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 6.21 0.20 5.81 6.62 1.00 2052 3057
## TX 0.70 0.29 1.28 0.13 1.00 2014 2843
##
## Family Specific Parameters:
## Estimate Est.Error l95% CI u95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.51 0.01 0.48 0.54 1.00 7316 8256
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
There are four parameters that we care about in that huge wall of text. We’ll pull them out as standalone objects (using TJ Mahr’s neat modeltolist trick) and show them in a table so we can keep track of everything easier.
fit %>%
tidy() %>%
mutate(Parameter = c("\\(\\beta_0\\)", "\\(\\beta_1\\)",
"\\(\\sigma_0\\)", "\\(\\sigma_y\\)")) %>%
mutate(Description = c("Global average gambling losses across all individuals",
"Effect of treatment on gambling losses for all individuals",
"Betweencluster variability of average gambling losses",
"Withincluster variability of gambling losses")) %>%
mutate(term = glue::glue("<code>{term}</code>"),
estimate = round(estimate, 3)) %>%
select(Parameter, Term = term, Description, Estimate = estimate) %>%
kbl(escape = FALSE) %>%
kable_styling(full_width = FALSE)
Parameter  Term  Description  Estimate 

\(\beta_0\)  (Intercept) 
Global average gambling losses across all individuals  6.210 
\(\beta_1\)  TX 
Effect of treatment on gambling losses for all individuals  0.702 
\(\sigma_0\)  sd__(Intercept) 
Betweencluster variability of average gambling losses  0.635 
\(\sigma_y\)  sd__Observation 
Withincluster variability of gambling losses  0.507 
There are a few problems with these estimates though: (1) they’re on the log odds scale, which isn’t very interpretable, and (2) neither the intercept term nor the term incorporate any details about the clusterlevel effects beyond the extra information we get through partial pooling. So our goal here is to transform these estimates into something interpretable that also incorporates grouplevel information.
Conditional effects refer to the effect of a variable in a typical group—country, cluster, school, subject, or whatever else is in the (1  group)
term in the model. “Typical” here means that the random offset is set to zero, or that there are no random effects involved.
The average outcome across the possible values of for a typical cluster is formally defined as
Exactly how you calculate this mathematically depends on the distribution family. For a lognormal distribution, it is this:
We can calculate this automatically with marginaleffects::predictions()
by setting re_formula = NA
to ignore all random effects, or to set all the random offsets to zero:
predictions(
fit,
newdata = datagrid(TX = c(0, 1)),
by = "TX",
re_formula = NA
)
## # A tibble: 2 × 6
## rowid type TX predicted conf.low conf.high
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 response 0 566. 379. 857.
## 2 2 response 1 281. 188. 425.
Because we’re working with Bayesian posteriors, we might as well do neat stuff with them instead of just collapsing them down to singlenumber point estimates. The posteriordraws()
function in {marginaleffects} lets us extract the modified/calculated MCMC draws, and then we can plot them with {tidybayes} / {ggdist}:
conditional_preds < predictions(
fit,
newdata = datagrid(TX = c(0, 1)),
by = "TX",
re_formula = NA
) %>%
posteriordraws()
p_conditional_preds < conditional_preds %>%
ggplot(aes(x = draw, fill = factor(TX))) +
stat_halfeye() +
scale_fill_manual(values = c(clrs[5], clrs[1])) +
scale_x_continuous(labels = label_dollar()) +
labs(x = "Gambling losses", y = "Density", fill = "TX",
title = "Conditional clusterspecific means",
subtitle = "Typical cluster where *b*<sub>0<sub>j</sub></sub> = 0") +
coord_cartesian(xlim = c(100, 1000)) +
theme_nice() +
theme(plot.subtitle = element_markdown())
p_conditional_preds
Neat.
The average treatment effect (ATE) for a binary treatment is the difference between the two averages when and :
For a lognormal family, it’s this:
We can again calculate this by setting re_formula = NA
in marginaleffects::comparisons()
:
# Clusterspecific average treatment effect (when offset is 0)
comparisons(
fit,
variables = "TX",
re_formula = NA
) %>%
tidy()
## # A tibble: 1 × 6
## type term contrast estimate conf.low conf.high
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 response TX 1  0 282. 590. 51.3
And here’s what the posterior of that conditional ATE looks like:
conditional_ate < comparisons(
fit,
variables = "TX",
re_formula = NA
) %>%
posteriordraws()
p_conditional_ate < conditional_ate %>%
ggplot(aes(x = draw)) +
stat_halfeye(fill = clrs[3]) +
scale_x_continuous(labels = label_dollar(style_negative = "minus")) +
labs(x = "(TX = 1) − (TX = 0)", y = "Density",
title = "Conditional clusterspecific ATE",
subtitle = "Typical cluster where *b*<sub>0<sub>j</sub></sub> = 0") +
coord_cartesian(xlim = c(900, 300)) +
theme_nice() +
theme(plot.subtitle = element_markdown())
p_conditional_ate
Marginal effects refer to the global or populationlevel effect of a variable. In multilevel models, coefficients can have random groupspecific offsets to a global mean. That’s what the in is in the formal model we defined earlier:
By definition, these offsets are distributed normally with a mean of 0 and a standard deviation of , or sd__(Intercept)
in {brms} output. We can visualize these clusterspecific offsets to get a better feel for how they work:
fit %>%
linpred_draws(tibble(cluster = unique(d$cluster),
TX = 0)) %>%
mutate(offset = B0  .linpred) %>%
ungroup() %>%
mutate(cluster = fct_reorder(factor(cluster), offset, .fun = mean)) %>%
ggplot(aes(x = offset, y = cluster)) +
geom_vline(xintercept = 0, color = clrs[2]) +
stat_pointinterval(color = clrs[4]) +
labs(x = "*b*<sub>0</sub> offset from β<sub>0</sub>") +
theme_nice() +
theme(axis.title.x = element_markdown())
The intercept for Cluster 1 here is basically the same as the global coefficient; Cluster 19 has a big positive offset, while Cluster 11 has a big negative offset.
The model parameters show the whole range of possible clusterspecific intercepts, or :
ggplot() +
stat_function(fun = ~dnorm(., mean = B0, sd = sigma_0^2),
geom = "area", fill = clrs[4]) +
xlim(4, 8) +
labs(x = "Possible clusterspecific intercepts", y = "Density",
title = glue::glue("Normal(µ = {round(B0, 3)}, σ = {round(sigma_0, 3)}<sup>2</sup>)")) +
theme_nice() +
theme(plot.title = element_markdown())
When generating populationlevel estimates, then, we need to somehow incorporate this range of possible clusterspecific intercepts into the populationlevel predictions. We can do this a couple different ways: we can (1) average, marginalize or integrate across them, or (2) integrate them out.
The average outcome across the possible values of for all clusters together is formally defined as
As with the conditional effects, the equation for calculating this depends on the family you’re using. For lognormal families, it’s this incredibly scary formula:
Wild. This is a mess because it integrates over the normallydistributed clusterspecific offsets, thus incorporating them all into the overall effect.
We can calculate this integral in a few different ways. Kristoffer Magnusson shows three different ways to calculate this hairy integral in his original post:
Numeric integration with integrate()
:
A magical momentgenerating function for the lognormal distribution:
exp(B_TXs + (sigma_0^2 + sigma_y^2)/2)
## 0 1
## 692 343
Brute force Monte Carlo integration, where we create a bunch of hypothetical cluster offsets with a mean of 0 and a standard deviation of , calculate the average outcome, then take the average of all those hypothetical clusters:
Those approaches are all great, but the math can get really complicated if there are interaction terms or splines or if you have more complex random effects structures (random slope offsets! nested groups!)
So instead we can use {marginaleffects} to handle all that complexity for us.
Average / marginalize / integrate across existing random effects: Here we calculate predictions for within each of the existing clusters. We then collapse them into averages for each level of . The values here are not identical to what we found with the earlier approaches, though they’re in the same general area. I’m not 100% why—I’m guessing it’s because there aren’t a lot of clusters to work with, so the averages aren’t really stable.
predictions(
fit,
newdata = datagrid(TX = c(0, 1),
cluster = unique),
by = "TX",
re_formula = NULL
)
## # A tibble: 2 × 5
## type TX predicted conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 response 0 647. 502. 905.
## 2 response 1 321. 250. 443.
We can visualize the posteriors too:
marginal_preds < predictions(
fit,
newdata = datagrid(TX = c(0, 1),
cluster = unique),
by = "TX",
re_formula = NULL
) %>%
posteriordraws()
p_marginal_preds < marginal_preds %>%
ggplot(aes(x = draw, fill = factor(TX))) +
stat_halfeye() +
scale_fill_manual(values = colorspace::lighten(c(clrs[5], clrs[1]), 0.4)) +
scale_x_continuous(labels = label_dollar()) +
labs(x = "Gambling losses", y = "Density", fill = "TX",
title = "Marginal populationlevel means",
subtitle = "Random effects averaged / marginalized / integrated") +
coord_cartesian(xlim = c(100, 1500)) +
theme_nice()
p_marginal_preds
Integrate out random effects: Instead of using the existing cluster intercepts, we can integrate out the random effects by generating predictions for a bunch of clusters (like 100), and then collapse those predictions into averages. This is similar to the intuition of brute force Monte Carlo integration in approach #3 earlier. This takes a long time! It results in the same estimates we found with the mathematical approaches in #1, #2, and #3 earlier.
predictions(fit, newdata = datagrid(TX = c(0, 1), cluster = c(1:100)),
allow_new_levels = TRUE,
sample_new_levels = "gaussian",
by = "TX")
## # A tibble: 2 × 5
## type TX predicted conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 response 0 682. 461. 1168.
## 2 response 1 340. 227. 577.
marginal_preds_int < predictions(
fit,
newdata = datagrid(TX = c(0, 1),
cluster = c(1:100)),
re_formula = NULL,
allow_new_levels = TRUE,
sample_new_levels = "gaussian",
by = "TX"
) %>%
posteriordraws()
p_marginal_preds_int < marginal_preds_int %>%
ggplot(aes(x = draw, fill = factor(TX))) +
stat_halfeye() +
scale_fill_manual(values = colorspace::lighten(c(clrs[5], clrs[1]), 0.4)) +
scale_x_continuous(labels = label_dollar()) +
labs(x = "Gambling losses", y = "Density", fill = "TX",
title = "Marginal populationlevel means",
subtitle = "Random effects integrated out") +
coord_cartesian(xlim = c(100, 1500)) +
theme_nice()
p_marginal_preds_int
The average treatment effect (ATE) for a binary treatment is the difference between the two averages when and , after somehow incorporating all the random clusterspecific offsets:
For a lognormal family, it’s this terrifying thing:
That looks scary, but really it’s just the difference in the two estimates we found before: and . We can use the same approaches from above and just subtract the two estimates, like this with the magical momentgenerating function thing:
Populationlevel ATE with momentgenerating function:
We can do this with {marginaleffects} too, either by averaging / marginalizing / integrating across existing clusters (though again, this weirdly gives slightly different results) or by integrating out the random effects from a bunch of hypothetical clusters (which gives the same result as the more analytical / mathematical estimates):
Average / marginalize / integrate across existing random effects:
# Marginal treatment effect (or global population level effect)
comparisons(
fit,
variables = "TX",
re_formula = NULL
) %>%
tidy()
## # A tibble: 1 × 6
## type term contrast estimate conf.low conf.high
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 response TX 1  0 326. 652. 60.9
marginal_ate < comparisons(
fit,
variables = "TX",
re_formula = NULL
) %>%
posteriordraws()
p_marginal_ate < marginal_ate %>%
group_by(drawid) %>%
summarize(draw = mean(draw)) %>%
ggplot(aes(x = draw)) +
stat_halfeye(fill = colorspace::lighten(clrs[3], 0.4)) +
scale_x_continuous(labels = label_dollar(style_negative = "minus")) +
labs(x = "(TX = 1) − (TX = 0)", y = "Density",
title = "Marginal populationlevel ATE",
subtitle = "Random effects averaged / marginalized / integrated") +
coord_cartesian(xlim = c(900, 300)) +
theme_nice()
p_marginal_ate
Integrate out random effects
# This takes a *really* long time
comparisons(
fit,
variables = "TX",
newdata = datagrid(cluster = c(1:100)),
re_formula = NULL,
allow_new_levels = TRUE,
sample_new_levels = "gaussian"
) %>%
tidy()
## # A tibble: 1 × 6
## type term contrast estimate conf.low conf.high
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 response TX 1  0 338. 779. 64.0
marginal_ate_int < comparisons(
fit,
variables = "TX",
newdata = datagrid(cluster = c(1:100)),
re_formula = NULL,
allow_new_levels = TRUE,
sample_new_levels = "gaussian"
) %>%
posteriordraws()
p_marginal_ate_int < marginal_ate_int %>%
group_by(drawid) %>%
summarize(draw = mean(draw)) %>%
ggplot(aes(x = draw)) +
stat_halfeye(fill = colorspace::lighten(clrs[3], 0.4)) +
scale_x_continuous(labels = label_dollar(style_negative = "minus")) +
labs(x = "(TX = 1) − (TX = 0)", y = "Density",
title = "Marginal populationlevel ATE",
subtitle = "Random effects integrated out") +
coord_cartesian(xlim = c(900, 300)) +
theme_nice()
p_marginal_ate_int
Finally, we can work directly with the coefficients to get more slopelike effects, which is especially helpful when the coefficient of interest isn’t for a binary variable. Typically with GLMs with log or logit links (like logit, Poisson, negative binomial, lognormal, etc.) we can exponentiate the coefficient to get it as an odds ratio or a multiplicative effect. That works here too:
exp(B1)
## b_TX
## 0.495
A oneunit increase in causes a 51% decrease (exp(B1)  1
) in the outcome. Great.
That’s all fine here because the lognormal model doesn’t have any weird nonlinearities or interactions, but in the case of logistic regression or anything with interaction terms, life gets more complicated, so it’s better to work with marginaleffects()
instead of exponentiating things by hand. If we use type = "link"
we’ll keep the results as logged odds, and then we can exponentiate them. All the other random effects options that we used before (re_formula = NA
, re_formula = NULL
, integrating effects out, and so on) work here too.
We can visualize the oddsratioscale posterior for fun:
marginaleffects(
fit,
variable = "TX",
type = "link",
newdata = datagrid(TX = 0)
) %>%
posteriordraws() %>%
mutate(draw = exp(draw)  1) %>%
ggplot(aes(x = draw)) +
stat_halfeye(fill = colorspace::darken(clrs[3], 0.4)) +
geom_vline(xintercept = 0) +
scale_x_continuous(labels = label_percent()) +
labs(x = "Percent change in outcome", y = "Density") +
theme_nice()
If we use type = "response"
, we can get slopes at specific values of the coefficient (which is less helpful here, since can only be 0 or 1; but it’s useful for continuous coefficients of interest).
Phew, that was a lot. Here’s a summary table to reference to help keep things straight.
wrap_r < function(x) glue::glue('<div class="sourceCode cellcode"><pre class="sourceCode r"><code class="sourceCode r">{x}</code></pre></div>')
conditional_out < r"{predictions(
fit,
newdata = datagrid(TX = c(0, 1)),
by = "TX",
re_formula = NA
)}"
conditional_ate < r"{comparisons(
fit,
variables = "TX",
re_formula = NA
)}"
marginal_out < r"{predictions(
fit,
newdata = datagrid(TX = c(0, 1),
cluster = unique),
by = "TX",
re_formula = NULL
)}"
marginal_out_int < r"{predictions(
fit,
newdata = datagrid(TX = c(0, 1),
cluster = c(1:100)),
re_formula = NULL,
allow_new_levels = TRUE,
sample_new_levels = "gaussian",
by = "TX"
)}"
marginal_ate < r"{comparisons(
fit,
variables = "TX",
re_formula = NULL
) %>%
tidy()
}"
marginal_ate_int < r"{comparisons(
fit,
variables = "TX",
newdata = datagrid(cluster = c(1:100)),
re_formula = NULL,
allow_new_levels = TRUE,
sample_new_levels = "gaussian"
) %>%
tidy()
}"
tribble(
~Effect, ~Formula, ~`{marginaleffects} code`,
"Average outcomes in typical group", "\\(\\textbf{E}(Y_{i_j} \\mid b_{0_j} = 0, \\text{TX} = \\{0, 1\\})\\)", wrap_r(conditional_out),
"ATE in typical group", "\\(\\textbf{E}(Y_{i_j} \\mid b_{0_j} = 0, \\text{TX} = 1) \\)<br> \\(\\quad\\textbf{E}(Y_{i_j} \\mid b_{0_j} = 0, \\text{TX} = 0)\\)", wrap_r(conditional_ate),
"Average populationlevel outcomes (marginalized)", "\\(\\textbf{E}(Y_{i_j} \\mid \\text{TX} = \\{0, 1\\})\\)", wrap_r(marginal_out),
"Average populationlevel outcomes (integrated out)", "\\(\\textbf{E}(Y_{i_j} \\mid \\text{TX} = \\{0, 1\\})\\)", wrap_r(marginal_out_int),
"Populationlevel ATE (marginalized)", "\\(\\textbf{E}(Y_{i_j} \\mid \\text{TX} = 1) \\)<br> \\(\\quad\\textbf{E}(Y_{i_j} \\mid \\text{TX} = 0)\\)", wrap_r(marginal_ate),
"Populationlevel ATE (integrated out)", "\\(\\textbf{E}(Y_{i_j} \\mid \\text{TX} = 1) \\)<br> \\(\\quad\\textbf{E}(Y_{i_j} \\mid \\text{TX} = 0)\\)", wrap_r(marginal_ate_int)
) %>%
kbl(escape = FALSE, align = c("l", "l", "l")) %>%
kable_styling(htmltable_class = "table tablesm") %>%
pack_rows(index = c("Conditional effects" = 2, "Marginal effects" = 4)) %>%
column_spec(1, width = "25%") >
column_spec(2, width = "35%") >
column_spec(3, width = "40%")
Effect  Formula  {marginaleffects} code 

Conditional effects  
Average outcomes in typical group  \(\textbf{E}(Y_{i_j} \mid b_{0_j} = 0, \text{TX} = \{0, 1\})\) 

ATE in typical group  \(\textbf{E}(Y_{i_j} \mid b_{0_j} = 0, \text{TX} = 1) \) \(\quad\textbf{E}(Y_{i_j} \mid b_{0_j} = 0, \text{TX} = 0)\) 

Marginal effects  
Average populationlevel outcomes (marginalized)  \(\textbf{E}(Y_{i_j} \mid \text{TX} = \{0, 1\})\) 

Average populationlevel outcomes (integrated out)  \(\textbf{E}(Y_{i_j} \mid \text{TX} = \{0, 1\})\) 

Populationlevel ATE (marginalized)  \(\textbf{E}(Y_{i_j} \mid \text{TX} = 1) \) \(\quad\textbf{E}(Y_{i_j} \mid \text{TX} = 0)\) 

Populationlevel ATE (integrated out)  \(\textbf{E}(Y_{i_j} \mid \text{TX} = 1) \) \(\quad\textbf{E}(Y_{i_j} \mid \text{TX} = 0)\) 

And here are all the posteriors all together, for easier comparison:
((p_conditional_preds + coord_cartesian(xlim = c(0, 1200)))  p_conditional_ate) /
((p_marginal_preds + coord_cartesian(xlim = c(0, 1200)))  p_marginal_ate) /
((p_marginal_preds_int + coord_cartesian(xlim = c(0, 1200)))  p_marginal_ate_int)
You can download PDF, SVG, and PNG versions of the diagrams and cheat sheets in this post, as well as the original Adobe Illustrator and InDesign files, at the bottom of this post
Do whatever you want with them! They’re licensed under Creative Commons AttributionShareAlike (BYSA 4.0).
I’ve been working with Bayesian models and the Stanbased brms ecosystem (tidybayes, ggdist, marginaleffects, and friends) for a few years now, and I’m currently finally working through formal materials on Bayesianism and running an independent readings class with a PhD student at GSU where we’re reading Richard McElreath’s Statistical Rethinking and Alicia Johnson, Miles Ott, and Mine Dogucu’s Bayes Rules!, both of which are fantastic books (check out my translation of their materials to tidyverse/brms here).
Something that has always plagued me about working with Bayesian posterior distributions, but that I’ve always waved off as too hard to think about, has been the differences between posterior predictions, the expectation of the posterior predictive distribution, and the posterior of the linear predictor (or posterior_predict()
, posterior_epred()
, and posterior_linpred()
in the brms world). But reading these two books has forced me to finally figure it out.
So here’s an explanation of my mental model of the differences between these types of posterior distributions. It’s definitely not 100% correct, but it makes sense for me.
For bonus fun, skip down to the incredibly useful diagrams and cheat sheets at the bottom of this post.
Let’s load some packages, load some data, and get started!
library(tidyverse) # ggplot, dplyr, and friends
library(patchwork) # Combine ggplot plots
library(ggtext) # Fancier text in ggplot plots
library(scales) # Labeling functions
library(brms) # Bayesian modeling through Stan
library(tidybayes) # Manipulate Stan objects in a tidy way
library(marginaleffects) # Calculate marginal effects
library(modelr) # For quick model grids
library(extraDistr) # For dprop() beta distribution with mu/phi
library(distributional) # For plotting distributions with ggdist
library(palmerpenguins) # Penguins!
library(kableExtra) # For nicer tables
# Make random things reproducible
set.seed(1234)
# Bayes stuff
# Use the cmdstanr backend for Stan because it's faster and more modern than
# the default rstan. You need to install the cmdstanr package first
# (https://mcstan.org/cmdstanr/) and then run cmdstanr::install_cmdstan() to
# install cmdstan on your computer.
options(mc.cores = 4, # Use 4 cores
brms.backend = "cmdstanr")
bayes_seed < 1234
# Colors from MetBrewer
clrs < MetBrewer::met.brewer("Java")
# Custom ggplot themes to make pretty plots
# Get Roboto Condensed at https://fonts.google.com/specimen/Roboto+Condensed
# Get Roboto Mono at https://fonts.google.com/specimen/Roboto+Mono
theme_pred < function() {
theme_minimal(base_family = "Roboto Condensed") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
axis.title.x = element_text(hjust = 0),
axis.title.y = element_text(hjust = 0),
legend.title = element_text(face = "bold"))
}
theme_pred_dist < function() {
theme_pred() +
theme(plot.title = element_markdown(family = "Roboto Condensed", face = "plain"),
plot.subtitle = element_text(family = "Roboto Mono", size = rel(0.9), hjust = 0),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
}
theme_pred_range < function() {
theme_pred() +
theme(plot.title = element_markdown(family = "Roboto Condensed", face = "plain"),
plot.subtitle = element_text(family = "Roboto Mono", size = rel(0.9), hjust = 0),
panel.grid.minor.y = element_blank())
}
update_geom_defaults("text", list(family = "Roboto Condensed", lineheight = 1))
# Add a couple new variables to the penguins data:
#  is_gentoo: Indicator for whether or not the penguin is a Gentoo
#  bill_ratio: The ratio of a penguin's bill depth (height) to its bill length
penguins < penguins >
drop_na(sex) >
mutate(is_gentoo = species == "Gentoo") >
mutate(bill_ratio = bill_depth_mm / bill_length_mm)
First we’ll look at basic linear regression. Normal or Gaussian models are roughly equivalent to frequentist ordinary least squares (OLS) regression. We estimate an intercept and a slope and draw a line through the data. If we include multiple explanatory variables or predictors, we’ll have multiple slopes, or partial derivatives or marginal effects (see here for more about that). But to keep things as simple and basic and illustrative as possible, we’ll just use one explanatory variable here.
In this example, we’re interested in the relationship between penguin flipper length and penguin body mass. Do penguins with longer flippers weigh more? Here’s what the data looks like:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(size = 1, alpha = 0.7) +
geom_smooth(method = "lm", color = clrs[5], se = FALSE) +
scale_y_continuous(labels = label_comma()) +
coord_cartesian(ylim = c(2000, 6000)) +
labs(x = "Flipper length (mm)", y = "Body mass (g)") +
theme_pred()
It seems like there’s a pretty clear relationship between the two. As flipper length increases, body mass also increases.
We can create a more formal model for the distribution of body mass, conditional on different values of flipper length, like this:
Or more generally:
This implies that body mass follows a normal (or Gaussian) distribution with some average () and some amount of spread (), and that the parameter is conditional on (or based on, or dependent on) flipper length.
Let’s run that model in Stan through brms (with all the default priors; in real life you’d want to set more official priors for the intercept , the coefficient , and the overall model spread )
If we look at the model results, we can see the means of the posterior distributions of each of the model’s parameters (, , and ). The intercept () is huge and negative because flipper length is far away from 0, so it’s pretty uninterpretable. The coefficient shows that a onemm increase in flipper length is associated with a 50 gram increase in body mass. And the overall model standard deviation shows that there’s roughly 400 grams of deviation around the mean body mass.
broom.mixed::tidy(model_normal) >
bind_cols(parameter = c("α", "β", "σ")) >
select(parameter, term, estimate, std.error, conf.low, conf.high)
## # A tibble: 3 × 6
## parameter term estimate std.error conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 α (Intercept) 5874. 311. 6466. 5257.
## 2 β flipper_length_mm 50.2 1.54 47.1 53.1
## 3 σ sd__Observation 394. 15.7 366. 426.
That table shows just the posterior means for each of these parameters, but these are technically all complete distributions. In this post we’re not interested in these actual values—we’re concerned with the outcome, or penguin weight here. (But you can see this post or this post or this post or this documentation for more about working with these coefficients and calculating marginal effects)
Going back to the formal model, so far we’ve looked at , , and , but what about and the overall posterior distribution of the outcome (or )? This is where life gets a little trickier (and why this guide exists in the first place!). Both and the posterior for represent penguin body mass, but conceptually they’re different things. We’ll extract these different distributions with three different brms functions: posterior_predict()
, posterior_epred()
, and posterior_linpred()
(the code uses predicted_draws()
, epred_draws()
, and linpred_draws()
; these are tidybayes’s wrappers for the corresponding brms functions).
Note the newdata
argument here. We have to feed a data frame of values to plug into to make these different posterior predictions. We could feed the original dataset with newdata = penguins
, which would plug each row of the data into the model and generate 4000 posterior draws for it. Given that there are 333 rows in penguins data, using newdata = penguins
would give us 333 × 4,000 = 1,332,000 rows. That’s a ton of data, and looking at it all together like that isn’t super useful unless we look at predictions across a range of possible predictors. We’ll do that later in this section and see the posterior predictions of weights across a range of flipper lengths. But here we’re just interested in the prediction of the outcome based on a single value of flipper lengths. We’ll use the average (200.967 mm), but it could easily be the median or whatever arbitrary number we want.
# Make a little dataset of just the average flipper length
penguins_avg_flipper < penguins >
summarize(flipper_length_mm = mean(flipper_length_mm))
# Extract different types of posteriors
normal_linpred < model_normal >
linpred_draws(newdata = penguins_avg_flipper)
normal_epred < model_normal >
epred_draws(newdata = penguins_avg_flipper)
normal_predicted < model_normal >
predicted_draws(newdata = penguins_avg_flipper,
seed = 12345) # So that the manual results with rnorm() are the same later
These each show the posterior distribution of penguin weight, and each corresponds to a different part of the formal mathematical model with. We can explore these nuances if we look at these distributions’ means, medians, standard deviations, and overall shapes:
summary_normal_linpred < normal_linpred >
ungroup() >
summarize(across(.linpred, lst(mean, sd, median), .names = "{.fn}"))
summary_normal_epred < normal_epred >
ungroup() >
summarize(across(.epred, lst(mean, sd, median), .names = "{.fn}"))
summary_normal_predicted < normal_predicted >
ungroup() >
summarize(across(.prediction, lst(mean, sd, median), .names = "{.fn}"))
tribble(
~Function, ~`Model element`,
"<code>posterior_linpred()</code>", "\\(\\mu\\) in the model",
"<code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\mu\\) in the model",
"<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Normal}(\\mu_i, \\sigma)\\)"
) >
bind_cols(bind_rows(summary_normal_linpred, summary_normal_epred, summary_normal_predicted)) >
kbl(escape = FALSE) >
kable_styling()
Function  Model element  mean  sd  median 

posterior_linpred() 
\(\mu\) in the model  4206  21.8  4207 
posterior_epred() 
\(\operatorname{E(y)}\) and \(\mu\) in the model  4206  21.8  4207 
posterior_predict() 
Random draws from posterior \(\operatorname{Normal}(\mu_i, \sigma)\)  4207  386.7  4209 
p1 < ggplot(normal_linpred, aes(x = .linpred)) +
stat_halfeye(fill = clrs[3]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(4100, 4300)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Linear predictor** <span style='fontsize: 14px;'>*µ* in the model</span>",
subtitle = "posterior_linpred(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist() +
theme(plot.title = element_markdown())
p2 < ggplot(normal_epred, aes(x = .epred)) +
stat_halfeye(fill = clrs[2]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(4100, 4300)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] and *µ* in the model</span>",
subtitle = "posterior_epred(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist()
p3 < ggplot(normal_predicted, aes(x = .prediction)) +
stat_halfeye(fill = clrs[1]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(2900, 5500)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>",
subtitle = "posterior_predict(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist()
(p1 / plot_spacer() / p2 / plot_spacer() / p3) +
plot_layout(heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
The most obvious difference between these different posterior predictions is the range of predictions. For posterior_linpred()
and posterior_epred()
, the standard error is tiny and the range of plausible predicted values is really narrow. For posterior_predict()
, the standard error is substantially bigger, and the corresponding range of predicted values is huge.
To understand why, let’s explore the math going on behind the scenes in these functions. Both posterior_linpred()
and posterior_epred()
correspond to the part of the model. They’re the average penguin weight as predicted by the linear model (hence linpred
; linear predictor). We can see this if we plug a 201 mm flipper length into each row of the posterior and calculate mu
by hand with :
linpred_manual < model_normal >
spread_draws(b_Intercept, b_flipper_length_mm) >
mutate(mu = b_Intercept +
(b_flipper_length_mm * penguins_avg_flipper$flipper_length_mm))
linpred_manual
## # A tibble: 4,000 × 6
## .chain .iteration .draw b_Intercept b_flipper_length_mm mu
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 6152. 51.5 4204.
## 2 1 2 2 5872 50.2 4221.
## 3 1 3 3 6263. 52.1 4202.
## 4 1 4 4 6066. 51.1 4213.
## 5 1 5 5 5740. 49.4 4191.
## 6 1 6 6 5678. 49.2 4213.
## 7 1 7 7 6107. 51.1 4160.
## 8 1 8 8 5422. 48.0 4235.
## 9 1 9 9 6303. 52.1 4177.
## 10 1 10 10 6193. 51.6 4184.
## # … with 3,990 more rows
That mu
column is identical to what we calculate with posterior_linpred()
. Just to confirm, we can plot the two distributions:
p1_manual < linpred_manual >
ggplot(aes(x = mu)) +
stat_halfeye(fill = colorspace::lighten(clrs[3], 0.5)) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(4100, 4300)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Linear predictor** <span style='fontsize: 14px;'>*µ* in the model</span>",
subtitle = "b_Intercept + (b_flipper_length_mm * 201)") +
theme_pred_dist() +
theme(plot.title = element_markdown())
p1_manual  p1
Importantly, the distribution of the part of the model here does not incorporate information about . That’s why the distribution is so narrow.
The results from posterior_predict()
, on the other hand, correspond to the part of the model. Officially, they are draws from a random normal distribution using both the estimated and the estimated . These results contain the full uncertainty of the posterior distribution of penguin weight. To help with the intuition, we can do the same thing by hand when plugging in a 201 mm flipper length:
set.seed(12345) # To get the same results as posterior_predict() from earlier
postpred_manual < model_normal >
spread_draws(b_Intercept, b_flipper_length_mm, sigma) >
mutate(mu = b_Intercept +
(b_flipper_length_mm *
penguins_avg_flipper$flipper_length_mm), # This is posterior_linpred()
y_new = rnorm(n(), mean = mu, sd = sigma)) # This is posterior_predict()
postpred_manual >
select(.draw:y_new)
## # A tibble: 4,000 × 6
## .draw b_Intercept b_flipper_length_mm sigma mu y_new
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6152. 51.5 384. 4204. 4429.
## 2 2 5872 50.2 401. 4221. 4506.
## 3 3 6263. 52.1 390. 4202. 4159.
## 4 4 6066. 51.1 409. 4213. 4027.
## 5 5 5740. 49.4 362. 4191. 4411.
## 6 6 5678. 49.2 393. 4213. 3499.
## 7 7 6107. 51.1 417. 4160. 4423.
## 8 8 5422. 48.0 351. 4235. 4138.
## 9 9 6303. 52.1 426. 4177. 4055.
## 10 10 6193. 51.6 426. 4184. 3793.
## # … with 3,990 more rows
That y_new
column here is the part of the model and should have a lot more uncertainty than the mu
column, which is just the part of the model. Notably, the y_new
column is the same as what we get when using posterior predict()
. We’ll plot the two distributions to confirm:
p3_manual < postpred_manual >
ggplot(aes(x = y_new)) +
stat_halfeye(fill = colorspace::lighten(clrs[1], 0.5)) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(2900, 5500)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>",
subtitle = "rnorm(b_Intercept + (b_flipper_length_mm * 201), sigma)") +
theme_pred_dist() +
theme(plot.title = element_markdown())
p3_manual  p3
The results from posterior_predict()
and posterior_linpred()
have the same mean, but the full posterior predictions that incorporate the estimated have a much wider range of plausible values.
The results from posterior_epred()
are a little strange to understand, and in the case of normal/Gaussian regression (and many other types of regression models!), they’re identical to the linear predictor (posterior_linpred()
). These are the posterior draws of the expected value or mean of the the posterior distribution, or in the model. Behind the scenes, this is calculated by taking the average of each row’s posterior distribution and then taking the average of that.
Once again, a quick illustration can help. As before, we’ll manually plug a flipper length of 201 mm into the posterior estimates of the intercept and slope to calculate the part of the model. We’ll then use that along with the estimated to in rnorm()
to generate the posterior predictive distribution, or the part of the model. Finally, we’ll take the average of the y_new
posterior predictive distribution to get the expectation of the posterior predictive distribution, or epred
. It’s the same as what we get when using posterior_epred()
; the only differences are because of randomness.
epred_manual < model_normal >
spread_draws(b_Intercept, b_flipper_length_mm, sigma) >
mutate(mu = b_Intercept +
(b_flipper_length_mm *
penguins_avg_flipper$flipper_length_mm), # This is posterior_linpred()
y_new = rnorm(n(), mean = mu, sd = sigma)) # This is posterior_predict()
# This is posterior_epred()
epred_manual >
summarize(epred = mean(y_new))
## # A tibble: 1 × 1
## epred
## <dbl>
## 1 4202.
# It's essentially the same as the actual posterior_epred()
normal_epred >
ungroup() >
summarize(epred = mean(.epred))
## # A tibble: 1 × 1
## epred
## <dbl>
## 1 4206.
For mathy reasons, in Gaussian regression, this happens to be identical to the linear predictor , so the results from posterior_linpred()
and posterior_epred()
are identical. And—fun fact—the brms code for posterior_epred()
for Gaussian models doesn’t recalculate the average of the posterior. It just returns the linear predictor .
We can also look at these different types of posterior predictions across a range of possible flipper lengths. There’s a lot more uncertainty in the full posterior, since it incorporates the uncertainty of both and , while the uncertainty of the linear predictor/expected value of the posterior is much more narrow (and equivalent in this case):
p1 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_linpred_draws(model_normal, ndraws = 100) >
ggplot(aes(x = flipper_length_mm)) +
stat_lineribbon(aes(y = .linpred), .width = 0.95,
alpha = 0.5, color = clrs[3], fill = clrs[3]) +
geom_point(data = penguins, aes(y = body_mass_g), size = 1, alpha = 0.7) +
scale_y_continuous(labels = label_comma()) +
coord_cartesian(ylim = c(2000, 6000)) +
labs(x = "Flipper length (mm)", y = "Body mass (g)",
title = "**Linear predictor** <span style='fontsize: 14px;'>*µ* in the model</span>",
subtitle = "posterior_linpred()") +
theme_pred_range()
p2 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_epred_draws(model_normal, ndraws = 100) >
ggplot(aes(x = flipper_length_mm)) +
stat_lineribbon(aes(y = .epred), .width = 0.95,
alpha = 0.5, color = clrs[2], fill = clrs[2]) +
geom_point(data = penguins, aes(y = body_mass_g), size = 1, alpha = 0.7) +
scale_y_continuous(labels = label_comma()) +
coord_cartesian(ylim = c(2000, 6000)) +
labs(x = "Flipper length (mm)", y = "Body mass (g)",
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] and *µ* in the model</span>",
subtitle = "posterior_epred()") +
theme_pred_range()
p3 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_predicted_draws(model_normal, ndraws = 100) >
ggplot(aes(x = flipper_length_mm)) +
stat_lineribbon(aes(y = .prediction), .width = 0.95,
alpha = 0.5, color = clrs[1], fill = clrs[1]) +
geom_point(data = penguins, aes(y = body_mass_g), size = 1, alpha = 0.7) +
scale_y_continuous(labels = label_comma()) +
coord_cartesian(ylim = c(2000, 6000)) +
labs(x = "Flipper length (mm)", y = "Body mass (g)",
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>",
subtitle = "posterior_predict()") +
theme_pred_range()
(p1 / plot_spacer() / p2 / plot_spacer() / p3) +
plot_layout(heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
Phew. There are a lot of moving parts here with different types of posteriors and averages and variances. Here’s a helpful diagram that shows how everything is connected and which R functions calculate which parts:
Generalized linear models (e.g., logistic, probit, ordered logistic, exponential, Poisson, negative binomial, etc.) use special link functions (e.g. logit, log, etc.) to transform the likelihood of an outcome into a scale that is more amenable to linear regression.
Estimates from these models can be used in their transformed scales (e.g., log odds in logistic regression) or can be backtransformed into their original scale (e.g., probabilities in logistic regression).
When working with links, the various Bayesian prediction functions return values on different scales, each corresponding to different parts of the model.
To show how different link functions work with posteriors from generalized linear models, we’ll use logistic regression with a single explanatory variable (again, for the sake of illustrative simplicity). We’re interested in whether a penguin’s bill length can predict if a penguin is a Gentoo or not. Here’s what the data looks like—Gentoos seem to have taller bills than their Chinstrap and Adélie counterparts.
ggplot(penguins, aes(x = bill_length_mm, y = as.numeric(is_gentoo))) +
geom_dots(aes(side = ifelse(is_gentoo, "bottom", "top")),
pch = 19, color = "grey20", scale = 0.2) +
geom_smooth(method = "glm", method.args = list(family = binomial(link = "logit")),
color = clrs[5], se = FALSE) +
scale_y_continuous(labels = label_percent()) +
labs(x = "Bill length (mm)", y = "Probability of being a Gentoo") +
theme_pred()
We ultimately want to model that curvy line, but working with regular slopes and intercepts makes it tricky, since the data is all constrained between 0% and 100% and the line is, um, curvy. If we were economists we could just stick a straight line on that graph, call it a linear probability model, and be done. But that’s weird.
Instead, we can transform the outcome variable from 0s and 1s into logged odds or logits, which creates a nice straight line that we can use with regular old linear regression. Again, I won’t go into the details of how logistic regression works here (see this example or this tutorial or this post or this post for lots more about it).
Just know that logits (or log odds) are a transformation of probabilities () into a different scale using on this formula:
This plot shows the relationship between the two scales. Probabilities range from 0 to 1, while logits typically range from −4 to 4ish, where logit of 0 is a of 0.5. There are big changes in probability between −4ish and 4ish, but once you start getting into the 5s and beyond, the probability is all essentially the same.
tibble(x = seq(8, 8, by = 0.1)) >
mutate(y = plogis(x)) >
ggplot(aes(x = x, y = y)) +
geom_line(size = 1, color = clrs[4]) +
labs(x = "Logit scale", y = "Probability scale") +
theme_pred()
We can create a formal model for the probability of being a Gentoo following a binomial distribution with a size of 1 (i.e. the distribution contains only 0s and 1s—either the penguin is a Gentoo or it is not), and a probability that is conditional on different values of bill length:
Or more generally,
Model time! Again, we’re using all the default priors here—in real life you’d want to set more official priors for the intercept and the coefficient , especially since is on the logit scale and unlikely to ever be bigger than 3 or 4.
We could look at these coefficients and interpret their marginal effects, but here we’re more interested in the distribution of the outcome, not the coefficients (see here or here or here for examples of how to interpret logistic regression coefficients).
Let’s again extract these different posterior distributions with the three main brms functions: posterior_linpred()
, posterior_epred()
, and posterior_predict()
. We’ll look at the posterior distribution when bill_length_mm
is its average value, or 43.993:
# Make a little dataset of just the average bill length
penguins_avg_bill < penguins >
summarize(bill_length_mm = mean(bill_length_mm))
# Extract different types of posteriors
logit_linpred < model_logit >
linpred_draws(newdata = penguins_avg_bill)
logit_epred < model_logit >
epred_draws(newdata = penguins_avg_bill)
logit_predicted < model_logit >
predicted_draws(newdata = penguins_avg_bill)
These each show the posterior distribution of being a Gentoo, but unlike the Gaussian posteriors we looked at earlier, each of these is measured completely differently now!
summary_logit_linpred < logit_linpred >
ungroup() >
summarize(across(.linpred, lst(mean, sd, median), .names = "{.fn}"))
summary_logit_epred < logit_epred >
ungroup() >
summarize(across(.epred, lst(mean, sd, median), .names = "{.fn}"))
summary_logit_predicted < logit_predicted >
ungroup() >
summarize(across(.prediction, lst(mean), .names = "{.fn}"))
tribble(
~Function, ~`Model element`, ~Values,
"<code>posterior_linpred()</code>", "\\(\\operatorname{logit}(\\pi)\\) in the model", "Logits or log odds",
"<code>posterior_linpred(transform = TRUE)</code> or <code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\pi\\) in the model", "Probabilities",
"<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Binomial}(1, \\pi)\\)", "0s and 1s"
) >
bind_cols(bind_rows(summary_logit_linpred, summary_logit_epred, summary_logit_predicted)) >
kbl(escape = FALSE) >
kable_styling()
Function  Model element  Values  mean  sd  median 

posterior_linpred() 
\(\operatorname{logit}(\pi)\) in the model  Logits or log odds  0.798  0.138  0.796 
posterior_linpred(transform = TRUE) or posterior_epred() 
\(\operatorname{E(y)}\) and \(\pi\) in the model  Probabilities  0.311  0.029  0.311 
posterior_predict() 
Random draws from posterior \(\operatorname{Binomial}(1, \pi)\)  0s and 1s  0.306 
p1 < ggplot(logit_linpred, aes(x = .linpred)) +
stat_halfeye(fill = clrs[3]) +
coord_cartesian(xlim = c(1.5, 0.2)) +
labs(x = "Logittransformed probability of being a Gentoo", y = NULL,
title = "**Linear predictor** <span style='fontsize: 14px;'>logit(*π*) in the model</span>",
subtitle = "posterior_linpred(..., tibble(bill_length_mm = 44))") +
theme_pred_dist()
p2 < ggplot(logit_epred, aes(x = .epred)) +
stat_halfeye(fill = clrs[2]) +
scale_x_continuous(labels = label_percent()) +
coord_cartesian(xlim = c(0.2, 0.45)) +
labs(x = "Probability of being a Gentoo", y = NULL,
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] and *π* in the model</span>",
subtitle = "posterior_epred(..., tibble(bill_length_mm = 44))") +
theme_pred_dist()
p3 < logit_predicted >
count(is_gentoo = .prediction) >
mutate(prop = n / sum(n),
prop_nice = label_percent(accuracy = 0.1)(prop)) >
ggplot(aes(x = factor(is_gentoo), y = n)) +
geom_col(fill = clrs[1]) +
geom_text(aes(label = prop_nice), nudge_y = 300, color = "white", size = 3) +
scale_x_discrete(labels = c("Not Gentoo (0)", "Gentoo (1)")) +
scale_y_continuous(labels = label_comma()) +
labs(x = "Prediction of being a Gentoo", y = NULL,
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Binomial(1, *π*)</span>",
subtitle = "posterior_predict(..., tibble(bill_length_mm = 44))") +
theme_pred_range() +
theme(panel.grid.major.x = element_blank())
(p1 / plot_spacer() / p2 / plot_spacer() / p3) +
plot_layout(heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
Unlike the Gaussian/normal regression from earlier, the results from posterior_epred()
and posterior_linpred()
are not identical here. They still both correspond to the part of the model, but on different scales. posterior_epred()
provides results on the probability scale, unlogiting and backtransforming the results from posterior_linpred()
(which provides results on the logit scale).
Again, technically, posterior_epred()
isn’t just the backtransformed linear predictor (if you want that, you can use posterior_linpred(..., transform = TRUE)
). More formally, posterior_epred()
returns the expected values of the posterior, or , or the average of the posterior’s averages. But as with Gaussian regression, for mathy reasons this averageofaverages happens to be the same as the backtransformed , so .
The results from posterior_predict()
are draws from a random binomial distribution using the estimated , and they consist of only 0s and 1s (not Gentoo and Gentoo).
Showing these posterior predictions across a range of bill lengths also helps with the intuition here and illustrates the different scales and values that these posterior functions return:
posterior_linpred()
returns the value of on the logit scaleposterior_epred()
returns the value of on the probability scale (technically it’s returning , but in practice those are identical here)posterior_predict()
returns 0s and 1s, plotted here as points at bill lengths of 35, 45, and 55 mmpred_logit_gentoo < tibble(bill_length_mm = c(35, 45, 55)) >
add_predicted_draws(model_logit, ndraws = 500)
pred_logit_gentoo_summary < pred_logit_gentoo >
group_by(bill_length_mm) >
summarize(prop = mean(.prediction),
prop_nice = paste0(label_percent(accuracy = 0.1)(prop), "\nGentoos"))
p1 < penguins >
data_grid(bill_length_mm = seq_range(bill_length_mm, n = 100)) >
add_linpred_draws(model_logit, ndraws = 100) >
ggplot(aes(x = bill_length_mm)) +
stat_lineribbon(aes(y = .linpred), .width = 0.95,
alpha = 0.5, color = clrs[3], fill = clrs[3]) +
coord_cartesian(xlim = c(30, 60)) +
labs(x = "Bill length (mm)", y = "Logittransformed\nprobability of being a Gentoo",
title = "**Linear predictor posterior** <span style='fontsize: 14px;'>logit(*π*) in the model</span>",
subtitle = "posterior_linpred()") +
theme_pred_range()
p2 < penguins >
data_grid(bill_length_mm = seq_range(bill_length_mm, n = 100)) >
add_epred_draws(model_logit, ndraws = 100) >
ggplot(aes(x = bill_length_mm)) +
geom_dots(data = penguins, aes(y = as.numeric(is_gentoo), x = bill_length_mm,
side = ifelse(is_gentoo, "bottom", "top")),
pch = 19, color = "grey20", scale = 0.2) +
stat_lineribbon(aes(y = .epred), .width = 0.95,
alpha = 0.5, color = clrs[2], fill = clrs[2]) +
scale_y_continuous(labels = label_percent()) +
coord_cartesian(xlim = c(30, 60)) +
labs(x = "Bill length (mm)", y = "Probability of\nbeing a Gentoo",
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] and *π* in the model</span>",
subtitle = "posterior_epred()") +
theme_pred_range()
p3 < ggplot(pred_logit_gentoo, aes(x = factor(bill_length_mm), y = .prediction)) +
geom_point(position = position_jitter(width = 0.2, height = 0.1, seed = 1234),
size = 0.75, alpha = 0.3, color = clrs[1]) +
geom_text(data = pred_logit_gentoo_summary, aes(y = 0.5, label = prop_nice), size = 3) +
scale_y_continuous(breaks = c(0, 1), labels = c("Not\nGentoo", "Gentoo")) +
labs(x = "Bill length (mm)", y = "Prediction of\nbeing a Gentoo",
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Binomial(1, *π*)</span>",
subtitle = "posterior_predict()") +
theme_pred_range() +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(angle = 90, hjust = 0.5))
(p1 / plot_spacer() / p2 / plot_spacer() / p3) +
plot_layout(heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
There are a lot more moving parts here than with Gaussian regression, with different types of posteriors measured on three different scales! This diagram summarizes everything:
Regression models often focus solely on the location parameter of the model (e.g., in ; in ). However, it is also possible to specify separate predictors for the scale or shape parameters of models (e.g., in , in ). In the world of brms, these are called distributional models.
More complex models can use a collection of distributional parameters. Zeroinflated beta models estimate a mean , precision , and a zeroinflated parameter zi
, while hurdle lognormal models estimate a mean , scale , and a hurdle parameter hu
. Even plain old Gaussian models become distributional models when a set of predictors is specified for (e.g. brm(y ~ x1 + x2, sigma ~ x2 + x3)
).
When working with extra distributional parameters, the various Bayesian posterior prediction functions return values on different scales for each different component of the model, making life even more complex! Estimates and distributional parameters (what brms calls dpar
in its functions) from these models can be used in their transformed scales or can be backtransformed into their original scale.
To show how different link functions and distributional parameters work with posteriors from distributional models, we’ll use beta regression with a single explanatory variable. The penguin data we’ve been using doesn’t have any variables that are proportions or otherwise constrained between 0 and 1, so we’ll make one up. Here we’re interested in the the ratio of penguin bill depth (equivalent to the height of the bill; see this illustration) to bill length and whether flipper length influences that ratio. I know nothing about penguins (or birds, for that matter), so I don’t know if biologists even care about the depth/length ratio in bills, but it makes a nice proportion so we’ll go with it.
Here’s what the relationship looks like—as flipper length increases, the bill ratio decreases. Longerflippered penguins have shorter and longer bills; shorterflippered penguins have taller bills in proportion to their lengths. Or something like that.
ggplot(penguins, aes(x = flipper_length_mm, y = bill_ratio)) +
geom_point(size = 1, alpha = 0.7) +
geom_smooth(method = "lm", color = clrs[5], se = FALSE) +
labs(x = "Flipper length (mm)", y = "Ratio of bill depth / bill length") +
theme_pred()
We want to model that green line, and in this case it appears nice and straight and could probably be modeled with regular Gaussian regression, but we also want to make sure any predictions are constrained between 0 and 1 since we’re working with a proportion. Beta regression is perfect for this. Once again, I won’t go into detail about how beta models work—I have a whole detailed guide to it here.
With beta regression, we need to model two parameters of the beta distribution—the mean and the precision . Ordinarily beta distributions are actually defined by two other parameters, called either shape 1 and shape 2 or and . The two systems of parameters are closely related and you can switch between them with a little algebra—see this guide for an example of how.
We can create a formal model for the distribution of the ratio of bill depth to bill length with a beta distribution with a mean and precision , each of which are conditional on different values of flipper length. The models for and don’t have to use the same explanatory variables—I’m just doing that here for the sake of simplicity.
Or more generally,
Let’s fit the model! But first, we’ll actually set more specific priors this time instead of relying on the defaults. Since is on the logit scale, it’s unlikely to ever have any huge numbers (i.e. anything beyond ±4; recall the probability scale/logit scale plot earlier). The default brms priors for coefficients in beta regression models are flat and uniform, resulting in some potentially huge and implausible priors that lead to really bad model fit (and really slow sampling!). So we’ll help Stan a little here and explicitly tell it that the coefficients will be small (normal(0, 1)
) and that must be positive (exponential(1)
with a lower bound of 0).
Again, we don’t care about the coefficients or marginal effects here—see this guide for more about how to work with those. Let’s instead extract these different posterior distributions of bill ratios with the three main brms functions: posterior_linpred()
, posterior_epred()
, and posterior_predict()
. And once again, we’ll use a single value flipper length (the average, 200.967 mm) to explore these distributions.
# Make a little dataset of just the average flipper length
penguins_avg_flipper < penguins >
summarize(flipper_length_mm = mean(flipper_length_mm))
# Extract different types of posteriors
beta_linpred < model_beta >
linpred_draws(newdata = penguins_avg_flipper)
beta_linpred_phi < model_beta >
linpred_draws(newdata = penguins_avg_flipper, dpar = "phi")
beta_linpred_trans < model_beta >
linpred_draws(newdata = penguins_avg_flipper, transform = TRUE)
beta_linpred_phi_trans < model_beta >
linpred_draws(newdata = penguins_avg_flipper, dpar = "phi", transform = TRUE)
beta_epred < model_beta >
epred_draws(newdata = penguins_avg_flipper)
beta_predicted < model_beta >
predicted_draws(newdata = penguins_avg_flipper)
Notice the addition of two new posteriors here: linpred_draws(..., dpar = "phi")
and linpred_draws(..., dpar = "phi", transform = TRUE)
. These give us the posterior distributions of the precision () distributional parameter, measured on different scales.
Importantly, for weird historical reasons, it is possible to use posterior_epred(..., dpar = "phi")
to get the unlogged parameter. However, conceptually this is wrong. An epred
is the expected value, or average, of the posterior predictive distribution, or . It is not the expected value of the part of the model. brms (or tidybayes) happily spits out the unlogged posterior distribution of when you use posterior_epred(..., dpar = "phi")
, but it’s technically not an epred
despite its name. To keep the terminology consistent, it’s best to use posterior_linpred()
when working with distributional parameters, using either transform = FALSE
or transform = TRUE
for the logged or the unlogged scale.
summary_beta_linpred < beta_linpred >
ungroup() >
summarize(across(.linpred, lst(mean, sd, median), .names = "{.fn}"))
summary_beta_linpred_phi < beta_linpred_phi >
ungroup() >
summarize(across(phi, lst(mean, sd, median), .names = "{.fn}"))
summary_beta_linpred_phi_trans < beta_linpred_phi_trans >
ungroup() >
summarize(across(phi, lst(mean, sd, median), .names = "{.fn}"))
summary_beta_epred < beta_epred >
ungroup() >
summarize(across(.epred, lst(mean, sd, median), .names = "{.fn}"))
summary_beta_predicted < beta_predicted >
ungroup() >
summarize(across(.prediction, lst(mean, sd, median), .names = "{.fn}"))
tribble(
~Function, ~`Model element`, ~Values,
"<code>posterior_linpred()</code>", "\\(\\operatorname{logit}(\\mu)\\) in the model", "Logits or log odds",
"<code>posterior_linpred(transform = TRUE)</code> or <code>posterior_epred()</code>", "\\(\\operatorname{E(y)}\\) and \\(\\mu\\) in the model", "Probabilities",
'<code>posterior_linpred(dpar = "phi")</code>', "\\(\\log(\\phi)\\) in the model", "Logged precision values",
'<code>posterior_linpred(dpar = "phi", transform = TRUE)</code>', "\\(\\phi\\) in the model", "Unlogged precision values",
"<code>posterior_predict()</code>", "Random draws from posterior \\(\\operatorname{Beta}(\\mu, \\phi)\\)", "Values between 0–1"
) >
bind_cols(bind_rows(summary_beta_linpred, summary_beta_epred,
summary_beta_linpred_phi, summary_beta_linpred_phi_trans,
summary_beta_predicted)) >
kbl(escape = FALSE) >
kable_styling()
Function  Model element  Values  mean  sd  median 

posterior_linpred() 
\(\operatorname{logit}(\mu)\) in the model  Logits or log odds  0.423  0.011  0.423 
posterior_linpred(transform = TRUE) or posterior_epred() 
\(\operatorname{E(y)}\) and \(\mu\) in the model  Probabilities  0.396  0.003  0.396 
posterior_linpred(dpar = "phi") 
\(\log(\phi)\) in the model  Logged precision values  4.672  0.078  4.675 
posterior_linpred(dpar = "phi", transform = TRUE) 
\(\phi\) in the model  Unlogged precision values  107.284  8.329  107.259 
posterior_predict() 
Random draws from posterior \(\operatorname{Beta}(\mu, \phi)\)  Values between 0–1  0.397  0.048  0.397 
Neat! We have a bunch of different pieces here, all measured differently. Let’s look at all these different pieces simultaneously:
p1 < ggplot(beta_linpred, aes(x = .linpred)) +
stat_halfeye(fill = clrs[3]) +
labs(x = "Logitscale ratio of bill depth / bill length", y = NULL,
title = "**Linear predictor** <span style='fontsize: 14px;'>logit(*µ*) in the model</span>",
subtitle = "posterior_linpred(\n ..., tibble(flipper_length_mm = 201))\n") +
theme_pred_dist()
p1a < ggplot(beta_linpred_phi, aes(x = phi)) +
stat_halfeye(fill = colorspace::lighten(clrs[3], 0.3)) +
labs(x = "Logscale precision parameter", y = NULL,
title = "**Precision parameter** <span style='fontsize: 14px;'>log(*φ*) in the model</span>",
subtitle = 'posterior_linpred(\n ..., tibble(flipper_length_mm = 201),\n dpar = "phi")') +
theme_pred_dist()
p2 < ggplot(beta_epred, aes(x = .epred)) +
stat_halfeye(fill = clrs[2]) +
labs(x = "Ratio of bill depth / bill length", y = NULL,
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] or *µ* in the model</span>",
subtitle = "posterior_epred(\n ..., tibble(flipper_length_mm = 201)) # or \nposterior_linpred(..., transform = TRUE)") +
theme_pred_dist()
p2a < ggplot(beta_linpred_phi_trans, aes(x = phi)) +
stat_halfeye(fill = colorspace::lighten(clrs[2], 0.4)) +
labs(x = "Precision parameter", y = NULL,
title = "**Precision parameter** <span style='fontsize: 14px;'>*φ* in the model</span>",
subtitle = 'posterior_linpred(\n ..., tibble(flipper_length_mm = 201),\n dpar = "phi", transform = TRUE)\n') +
theme_pred_dist()
p3 < ggplot(beta_predicted, aes(x = .prediction)) +
stat_halfeye(fill = clrs[1]) +
coord_cartesian(xlim = c(0.2, 0.6)) +
labs(x = "Ratio of bill depth / bill length", y = NULL,
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Beta(*µ*, *φ*)</span>",
subtitle = "posterior_predict()") +
theme_pred_dist()
layout < "
AB
CC
DE
FF
GG
"
p1 + p1a + plot_spacer() + p2 + p2a + plot_spacer() + p3 +
plot_layout(design = layout, heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
As with logistic regression, the results from posterior_epred()
and posterior_linpred()
are not identical. They still both correspond to the part of the model, but on different scales. posterior_epred()
provides results on the probability or proportion scale, unlogiting and backtransforming the logitscale results from posterior_linpred()
.
And once again, posterior_epred()
isn’t technically the backtransformed linear predictor (if you want that, you can use posterior_linpred(..., transform = TRUE)
). Instead it shows the expected values of the posterior, or , or the average of the posterior’s averages. But just like Gaussian regression and logistic regression, this averageofaverages still happens to be the same as the backtransformed , so .
We can extract the parameter by including the dpar = "phi"
argument (or technically just dpar = TRUE
, which returns all possible distributional parameters, which is helpful in cases with lots of them like zerooneinflated beta regression). posterior_linpred(..., dpar = "phi", transform = TRUE)
provides on the original precision scale (however that’s measured), while posterior_linpred(..., dpar = "phi")
returns a logtransformed version.
And finally, the results from posterior_predict()
are draws from a random beta distribution using the estimated and , and they consist of values ranging between 0 and 1.
Showing the posterior predictions for these different parameters across a range of flipper lengths will help with the intuition and illustrate the different scales, values, and parameters that these posterior functions return:
posterior_linpred()
returns the value of on the logit scaleposterior_epred()
returns the value of on the probability scale (technically it’s returning , but in practice those are identical here)posterior_linpred(..., dpar = "phi")
returns the logged value of
posterior_linpred(..., dpar = "phi", transform = TRUE)
returns the value of on its original scaleposterior_predict()
returns probabilities or proportionsp1 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_linpred_draws(model_beta, ndraws = 100) >
ggplot(aes(x = flipper_length_mm)) +
geom_point(data = penguins, aes(y = qlogis(bill_ratio)), size = 1, alpha = 0.7) +
stat_lineribbon(aes(y = .linpred), .width = 0.95,
alpha = 0.5, color = clrs[3], fill = clrs[3]) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Logitscale ratio of\nbill depth / bill length",
title = "**Linear predictor posterior** <span style='fontsize: 14px;'>logit(*µ*) in the model</span>",
subtitle = "posterior_linpred()") +
theme_pred_range()
p1a < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_linpred_draws(model_beta, ndraws = 100, dpar = "phi") >
ggplot(aes(x = flipper_length_mm)) +
stat_lineribbon(aes(y = phi), .width = 0.95, alpha = 0.5,
color = colorspace::lighten(clrs[3], 0.3), fill = colorspace::lighten(clrs[3], 0.3)) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Logscale\nprecision parameter",
title = "**Precision parameter** <span style='fontsize: 14px;'>log(*φ*) in the model</span>",
subtitle = 'posterior_linpred(dpar = "phi")') +
theme_pred_range()
p2 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_epred_draws(model_beta, ndraws = 100) >
ggplot(aes(x = flipper_length_mm)) +
geom_point(data = penguins, aes(y = bill_ratio), size = 1, alpha = 0.7) +
stat_lineribbon(aes(y = .epred), .width = 0.95,
alpha = 0.5, color = clrs[2], fill = clrs[2]) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Ratio of\nbill depth / bill length",
title = "**Expectation of the posterior** <span style='fontsize: 14px;'>E[*y*] or *µ* in the model</span>",
subtitle = 'posterior_epred()\nposterior_linpred(transform = TRUE)') +
theme_pred_range()
p2a < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_epred_draws(model_beta, ndraws = 100, dpar = "phi") >
ggplot(aes(x = flipper_length_mm)) +
stat_lineribbon(aes(y = phi), .width = 0.95, alpha = 0.5,
color = colorspace::lighten(clrs[2], 0.4), fill = colorspace::lighten(clrs[2], 0.4)) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Precision parameter",
title = "**Precision parameter** <span style='fontsize: 14px;'>*φ* in the model</span>",
subtitle = 'posterior_linpred(dpar = "phi",\n transform = TRUE)') +
theme_pred_range()
p3 < penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_predicted_draws(model_beta, ndraws = 500) >
ggplot(aes(x = flipper_length_mm)) +
geom_point(data = penguins, aes(y = bill_ratio), size = 1, alpha = 0.7) +
stat_lineribbon(aes(y = .prediction), .width = 0.95,
alpha = 0.5, color = clrs[1], fill = clrs[1]) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Ratio of\nbill depth / bill length",
title = "**Posterior predictions** <span style='fontsize: 14px;'>Random draws from posterior Beta(*µ*, *φ*)</span>",
subtitle = "posterior_predict()") +
theme_pred_range()
layout < "
AB
CC
DE
FF
GG
"
p1 + p1a + plot_spacer() + p2 + p2a + plot_spacer() + p3 +
plot_layout(design = layout, heights = c(0.3, 0.05, 0.3, 0.05, 0.3))
So many moving parts in these distributional models! This diagram summarizes all these different posteriors, scales, and distributional parameters:
Before finishing with beta regression, we can play around with some of these posterior parameters to better understand what this kind of distributional model is actually doing. First, we can plot the posterior distribution using the means of the posterior and parameters instead of using the results from posterior_predict()
, creating a pseudoanalytical posterior distribution. We’ll use the dprop()
function from the extraDistr package instead of dbeta()
, since dprop
uses and instead of shape 1 and shape 2.
It’s not the greatest model at all—the actual distribution of bill ratios is bimodal (probably because of speciesspecific differences), but using the posterior values for and creates a distribution that picks up the average ratio.
In practice we typically don’t actually want to use these two parameters like this—we can use the results from posterior_predict()
instead—but it’s cool that we can produce the same distribution with these parameters. That’s the magic of these distributional models!
mu < summary_beta_epred$mean
phi < summary_beta_linpred_phi_trans$mean
ggplot(penguins, aes(x = bill_ratio)) +
geom_density(aes(fill = "Actual data"), color = NA) +
stat_function(
aes(fill = glue::glue("Beta(µ = {round(mu, 3)}, φ = {round(phi, 2)})")),
geom = "area", fun = ~ extraDistr::dprop(., mean = mu, size = phi),
alpha = 0.7
) +
scale_fill_manual(values = c(clrs[5], clrs[1]), name = NULL) +
xlim(c(0.2, 0.65)) +
labs(x = "Ratio of bill depth / bill length", y = NULL,
title = "**Analytical posterior predictions** <span style='fontsize: 14px;'>Average posterior *µ* and *φ* from the model</span>") +
theme_pred_dist() +
theme(legend.position = c(0, 0.9),
legend.justification = "left",
legend.key.size = unit(0.75, "lines"))
For even more fun, because we modeled the parameter as conditional on flipper length, it changes depending on different flipper lengths. This means that the actual posterior beta distribution is shaped differently across a whole range of lengths. Here’s what that looks like, with analytical distributions plotted at 180, 200, and 200 mm. As the precision increases, the distributions become more narrow and precise (which is also reflected in the size of the posterior_predict()
based credible intervals around the points)
muphi_to_shapes < function(mu, phi) {
shape1 < mu * phi
shape2 < (1  mu) * phi
return(lst(shape1 = shape1, shape2 = shape2))
}
beta_posteriors < tibble(flipper_length_mm = c(180, 200, 220)) >
add_linpred_draws(model_beta, ndraws = 500, dpar = TRUE, transform = TRUE) >
group_by(flipper_length_mm) >
summarize(across(c(mu, phi), ~mean(.))) >
ungroup() >
mutate(shapes = map2(mu, phi, ~as_tibble(muphi_to_shapes(.x, .y)))) >
unnest(shapes) >
mutate(nice_label = glue::glue("Beta(µ = {round(mu, 3)}, φ = {round(phi, 2)})"))
# Here are the parameters we'll use
# We need to convert the mu and phi values to shape1 and shape2 so that we can
# use dist_beta() to plot the halfeye distributions correctly
beta_posteriors
## # A tibble: 3 × 6
## flipper_length_mm mu phi shape1 shape2 nice_label
## <dbl> <dbl> <dbl> <dbl> <dbl> <glue>
## 1 180 0.485 58.1 28.2 29.9 Beta(µ = 0.485, φ = 58.1)
## 2 200 0.400 104. 41.7 62.6 Beta(µ = 0.4, φ = 104.26)
## 3 220 0.320 190. 61.0 129. Beta(µ = 0.32, φ = 190.3)
penguins >
data_grid(flipper_length_mm = seq_range(flipper_length_mm, n = 100)) >
add_predicted_draws(model_beta, ndraws = 500) >
ggplot(aes(x = flipper_length_mm)) +
geom_point(data = penguins, aes(y = bill_ratio), size = 1, alpha = 0.7) +
stat_halfeye(data = beta_posteriors, aes(ydist = dist_beta(shape1, shape2), y = NULL),
side = "bottom", fill = clrs[1], alpha = 0.75) +
stat_lineribbon(aes(y = .prediction), .width = 0.95,
alpha = 0.1, color = clrs[1], fill = clrs[1]) +
geom_text(data = beta_posteriors,
aes(x = flipper_length_mm, y = 0.9, label = nice_label),
hjust = 0.5) +
coord_cartesian(xlim = c(170, 230)) +
labs(x = "Flipper length (mm)", y = "Ratio of\nbill depth / bill length",
title = "**Analytical posterior predictions** <span style='fontsize: 14px;'>Average posterior *µ* and *φ* from the model</span>") +
theme_pred_range()
## Warning: Unknown or uninitialised column: `linewidth`.
posterior_epred()
isn’t just the backtransformed linear predictorIn all the examples in this guide, the results from posterior_epred()
have been identical to the backtransformed results from posterior_linpred()
(or posterior_linpred(..., transform = TRUE)
if there are link functions). With logistic regression, posterior_epred()
returned the probabilityscale values of ; with beta regression, posterior_epred()
returned the proportion/probabilityscale values of . This is the case for many model families in Stan and brms—for mathy reasons that go beyond my skills, the average of averages is the same as the backtransformed linear predictor for lots of distributions.
This isn’t always the case though! In some families, like lognormal models, posterior_epred()
and posterior_linpred(..., transform = TRUE)
give different estimates. For lognormal models isn’t just one of the distribution’s parameters—it’s this:
I won’t show any examples of that here—this guide is already too long—but Matthew Kay has an example here that shows the differences between expected posterior values and backtransformed linear posterior values.
To see which kinds of families use fancier epred
s, look at the source for brms::posterior_epred()
here. Most of the families just use the backtransformed mu
(prep\$dpars\$mu
in the code), but some have special values, like lognormal’s with(prep$dpars, exp(mu + sigma^2 / 2))
Keeping track of which kinds of posterior predictions you’re working with, on which scales, and for which parameters, can be tricky, especially with more complex models with lots of moving parts. To make life easier, here are all the summary diagrams in one place:
(Download a PDF) or (download original Adobe Illustrator file)
(Download a PDF) or (download original Adobe Illustrator file)
(Download a PDF) or (download original Adobe Illustrator file)
And here’s an even more detailed summary cheat sheet as a printable PDF:
(Download a PDF) or (download the original Adobe InDesign file)
group_by()
and summarize()
and shows some interesting trends.
The data includes a column for CATEGORY
, showing the type of construction project that was allowed. It poses an interesting (and common!) visualization challenge: some of the category names are really long, and if you plot CATEGORY
on the xaxis, the labels overlap and become unreadable, like this:
library(tidyverse) # dplyr, ggplot2, and friends
library(scales) # Functions to format things nicely
# Load pandemic construction data
essential_raw < read_csv("https://datavizs22.classes.andrewheiss.com/projects/04exercise/data/EssentialConstruction.csv")
essential_by_category < essential_raw %>%
# Calculate the total number of projects within each category
group_by(CATEGORY) %>%
summarize(total = n()) %>%
# Sort by total
arrange(desc(total)) %>%
# Make the category column ordered
mutate(CATEGORY = fct_inorder(CATEGORY))
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects")
Ew. The middle categories here get all blended together into an unreadable mess.
Fortunately there are a bunch of different ways to fix this, each with their own advantages and disadvantages!
One quick and easy way to fix this is to change the dimensions of the plot so that there’s more space along the xaxis. If you’re using R Markdown or Quarto, you can modify the chunk options and specify fig.width
:
```{r nameofchunk, fig.width=10, fig.height=4}
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects")
```
If you’re using ggsave()
, you can specify the height and width there too:
ggsave(name_of_plot, width = 10, height = 4, units = "in")
That works, but now the font is tiny, so we need to adjust it up with theme_gray(base_size = 18)
:
```{r nameofchunk, fig.width=10, fig.height=4}
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects") +
theme_gray(base_size = 18)
```
Now the font is bigger, but the labels overlap again! We could make the figure wider again, but then we’d need to increase the font size again, and now we’re in an endless loop.
Verdict: 2/10, easy to do, but more of a quick bandaidstyle solution; not super recommended.
Another quick and easy solution is to switch the x and yaxes. If we put the categories on the yaxis, each label will be on its own line so the labels can’t overlap with each other anymore:
ggplot(essential_by_category,
aes(y = fct_rev(CATEGORY), x = total)) +
geom_col() +
scale_x_continuous(labels = comma) +
labs(y = NULL, x = "Total projects")
That works really well! However, it forces you to work with horizontal bars. If that doesn’t fit with your overall design (e.g., if you really want vertical bars), this won’t work. Additionally, if you have any really long labels, it can substantially shrink the plot area, like this:
# Make one of the labels super long for fun
essential_by_category %>%
mutate(CATEGORY = recode(CATEGORY, "Schools" = "Preschools, elementary schools, middle schools, high schools, and other schools")) %>%
ggplot(aes(y = fct_rev(CATEGORY), x = total)) +
geom_col() +
scale_x_continuous(labels = comma) +
labs(y = NULL, x = "Total projects")
Verdict: 6/10, easy to do and works well if you’re happy with horizontal bars; can break if labels are too long (though long yaxis labels are fixable with the other techniques in this post too).
Instead of messing with the width of the plot, we can mess with the category names themselves. We can use recode()
from dplyr to recode some of the longer category names or add line breaks (\n
) to them:
essential_by_category_shorter < essential_by_category %>%
mutate(CATEGORY = recode(CATEGORY,
"Affordable Housing" = "Aff. Hous.",
"Hospital / Health Care" = "Hosp./Health",
"Public Housing" = "Pub. Hous.",
"Homeless Shelter" = "Homeless\nShelter"))
ggplot(essential_by_category_shorter,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects")
That works great! However, it reduces readibility (does “Aff. Hous.” mean affordable housing? affluent housing? affable housing?). It also requires more manual work and a lot of extra typing. If a new longer category gets added in a later iteration of the data, this code won’t automatically shorten it.
Verdict: 6/10, we have more control over the labels, but too much abbreviation reduces readibility, and it’s not automatic.
Since we want to avoid manually recoding categories, we can do some visual tricks to make the labels readable without changing any of the lable text. First we can rotate the labels a little. Here we rotate the labels 30°, but we could also do 45°, 90°, or whatever we want. If we add hjust = 0.5
(horizontal justification), the rotated labels will be centered in the columns, and vjust
(vertical justification) will center the labels vertically.
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects") +
theme(axis.text.x = element_text(angle = 30, hjust = 0.5, vjust = 0.5))
Everything fits great now, but I’m not a big fan of angled text. I’m also not happy with the all the empty vertical space between the axis and the shorter labels like “Schools” and “Utility”. It would look a lot nicer to have all these labels rightaligned to the axis, but there’s no way easy to do that.
Verdict: 5.5/10, no manual work needed, but angled text is harder to read and there’s lots of extra uneven whitespace.
Second, instead of rotating, as of ggplot2 v3.3.0 we can automatically dodge the labels and make them offset across multiple rows with the guide_axis(n.dodge = N)
function in scale_x_*()
:
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects")
That’s pretty neat. Again, this is all automatic and we don’t have to manually adjust any labels. The text is all horizontal so it’s more readable. But I’m not a huge fan of the gaps above the secondrow labels. Maybe it would look better if the corresponding axis ticks were a little longer, idk.
Verdict: 7/10, no manual work needed, labels easy to read, but there’s extra whitespace that can sometimes feel unbalanced.
The easiest and quickest and nicest way to fix these long labels, though, is to use the label_wrap()
function from the scales package. This will automatically add line breaks after X characters in labels with lots of text—you just have to tell it how many characters to use. The function is smart enough to try to break after word boundaries—that is, if you tell it to break after 5 characters, it won’t split something like “Approved” into “Appro” and “ved”; it’ll break after the end of the word.
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_x_discrete(labels = label_wrap(10)) +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects")
Look at how the xaxis labels automatically break across lines! That’s so neat!
Verdict: 11/10, no manual work needed, labels easy to read, everything’s perfect. This is the way.
Bonus: For things that aren’t axis labels, like titles and subtitles, you can use str_wrap()
from stringr to break long text at X characters (specified with width
):
ggplot(essential_by_category,
aes(x = CATEGORY, y = total)) +
geom_col() +
scale_x_discrete(labels = label_wrap(10)) +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "Total projects",
title = str_wrap(
"Here's a really long title that will go off the edge of the figure unless it gets broken somewhere",
width = 50),
subtitle = str_wrap(
"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
width = 70))
Here’s a quick comparison of all these different approaches:
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (20220623)
## os macOS Monterey 12.6
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF8
## ctype en_US.UTF8
## tz America/New_York
## date 20221129
## pandoc 2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
## quarto 1.3.26 @ /usr/local/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ! package * version date (UTC) lib source
## P dplyr * 1.0.10 20220901 [?] CRAN (R 4.2.0)
## P forcats * 0.5.2 20220819 [?] CRAN (R 4.2.0)
## P ggplot2 * 3.4.0 20221104 [?] CRAN (R 4.2.0)
## P patchwork * 1.1.2 20220819 [?] CRAN (R 4.2.0)
## P purrr * 0.3.5 20221006 [?] CRAN (R 4.2.0)
## P readr * 2.1.3 20221001 [?] CRAN (R 4.2.0)
## P scales * 1.2.1 20220820 [?] CRAN (R 4.2.0)
## P sessioninfo * 1.2.2 20211206 [?] CRAN (R 4.2.0)
## P stringr * 1.4.1 20220820 [?] CRAN (R 4.2.0)
## P tibble * 3.1.8 20220722 [?] CRAN (R 4.2.0)
## P tidyr * 1.2.1 20220908 [?] CRAN (R 4.2.0)
## P tidyverse * 1.3.2 20220718 [?] CRAN (R 4.2.0)
##
## [1] /Users/andrew/Sites/athquarto/renv/library/R4.2/aarch64appledarwin20
## [2] /Users/andrew/Sites/athquarto/renv/sandbox/R4.2/aarch64appledarwin20/84ba8b13
##
## P ── Loaded and ondisk path mismatch.
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
I’m a huge fan of doing research and analysis in public. I try to make my research public and freely accessible, but ever since watching David Robinson’s “The unreasonable effectiveness of public work” keynote from rstudio::conf 2019, I’ve tried to make my research process open and accessible too.
According to David, researchers typically view their work like this:
People work towards a final published product, which is the most valuable output of the whole process. The intermediate steps like the code, data, preliminary results, and so on, are less valuable and often hidden from the public. People only see the final published thing.
David argues that we should instead see our work like this:
In this paradigm, anything on your computer and only accessible by you isn’t that valuable. Anything you make accessible to the public online—including all the intermediate stuff like code, data, and preliminary results, in addition to the final product—is incredibly valuable. The world can benefit from neat code tricks you stumble on while making graphs; the world can benefit from new data sources you find or your way of processing data; the world can benefit from a toy example of a new method you read about in some paper, even if the actual code you write to play around with the method never makes it into any published paper. It’s all useful to the broader community of researchers.
Public work also builds community norms—if more people share their behindthescenes work, it encourages others to do the same and engage with it and improve it (see this super detailed and helpful comment with corrections to my previous post, for example!).
Public work is also valuable for another more selfish reason. Building an online presence with a wide readership is hard, and my little blog post contributions aren’t famous or anything—they’re just sitting out here in a tiny corner of the internet. But these guides have been indispensable for me. They’ve allowed me to work through and understand tricky statistical and programming concepts, and then have allowed me to come back to them months later and remember how they work. This whole blog is primarily a resource for future me.
So here’s yet another blog post that is hopefully potentially useful for the general public, but that is definitely useful for future me.
In a few of my ongoing research projects, I’m working with nonlinear regression models, and I’ve been struggling to interpret their results. In my past few posts (like this one on hurdle models, or this one on multilevel panel data, or this one on beta and zeroinflated models), I’ve explored a bunch of different ways to work with and interpret these more complex models and calculate their marginal effects. I even wrote a guide to calculating average marginal effects for multilevel models. TURNS OUT™, though, that I’ve actually been a bit wrong about my terminology for all the marginal effects I’ve talked about in those posts.
Part of the reason for this wrongness is because there are so many quasisynonyms for the idea of “marginal effects” and people seem to be pretty loosey goosey about what exactly they’re referring to. There are statistical effects, marginal effects, marginal means, marginal slopes, conditional effects, conditional marginal effects, marginal effects at the mean, and many other similarlynamed ideas. There are also regression coefficients and estimates, which have marginal effects vibes, but may or may not actually be marginal effects depending on the complexity of the model.
The question of what the heck “marginal effects” are has plagued me for a while. In October 2021 I publicly announced that I would finally buckle down and figure out their definitions and nuances:
And then I didn’t.
So here I am, 7 months later, publicly figuring out the differences between regression coefficients, regression predictions, marginaleffects, emmeans, marginal slopes, average marginal effects, marginal effects at the mean, and all these other “marginal” things that researchers and data scientists use.
This guide is highly didactic and slowly builds up the concept of marginal effects as slopes and partial derivatives. The tl;dr section at the end has a useful summary of everything here, with a table showing all the different approaches to marginal effects with corresponding marginaleffects and emmeans code, as well as some diagrams outlining the two packages’ different approaches to averaging. Hopefully it’s useful—it is for me!
Let’s get started by looking at some lines and slopes (after loading a bunch of packages and creating some useful little functions).
# Load packages
# 
library(tidyverse) # dplyr, ggplot2, and friends
library(broom) # Convert models to data frames
library(marginaleffects) # Marginal effects stuff
library(emmeans) # Marginal effects stuff
# Visualizationrelated packages
library(ggtext) # Add markdown/HTML support to text in plots
library(glue) # Pythonesque string interpolation
library(scales) # Functions to format numbers nicely
library(gganimate) # Make animated plots
library(patchwork) # Combine ggplots
library(ggrepel) # Make labels that don't overlap
library(MetBrewer) # Artsy color palettes
# Datarelated packages
library(palmerpenguins) # Penguin data
library(WDI) # Get data from the World Bank's API
library(countrycode) # Map country codes to different systems
library(vdemdata) # Use data from the Varieties of Democracy (VDem) project
# Install vdemdata from GitHub, not CRAN
# devtools::install_github("vdeminstitute/vdemdata")
# Helpful functions
# 
# Format numbers in pretty ways
nice_number < label_number(style_negative = "minus", accuracy = 0.01)
nice_p < label_pvalue(prefix = c("p < ", "p = ", "p > "))
# Pointslope formula: (y  y1) = m(x  x1)
find_intercept < function(x1, y1, slope) {
intercept < slope * (x1) + y1
return(intercept)
}
# Visualization settings
# 
# Custom ggplot theme to make pretty plots
# Get IBM Plex Sans Condensed at https://fonts.google.com/specimen/IBM+Plex+Sans+Condensed
theme_mfx < function() {
theme_minimal(base_family = "IBM Plex Sans Condensed") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(face = "bold"),
axis.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
legend.title = element_text(face = "bold"))
}
# Make labels use IBM Plex Sans by default
update_geom_defaults("label",
list(family = "IBM Plex Sans Condensed"))
update_geom_defaults(ggtext::GeomRichText,
list(family = "IBM Plex Sans Condensed"))
update_geom_defaults("label_repel",
list(family = "IBM Plex Sans Condensed"))
# Use the Johnson color palette
clrs < met.brewer("Johnson")
Put as simply as possible, in the world of statistics, “marginal” means “additional,” or what happens to outcome variable when explanatory variable changes a little.
To find out precisely how much things change, we need to use calculus.
Oh no.
I haven’t taken a formal calculus class since my senior year of high school in 2002. I enjoyed it a ton and got the highest score on the AP Calculus BC test, which gave me enough college credits to not need it as an undergraduate, given that I majored in Middle East Studies, Arabic, and Italian. I figured I’d never need to think about calculus every again. lol.
In my first PhDlevel stats class in 2012, the professor cancelled class for the first month and assigned us all to go relearn calculus with Khan Academy, since I wasn’t alone in my unlearning of calculus. Even after that crash course refresher, I don’t really ever use it in my own research. When I do, I only use it to think about derivatives and slopes, since those are central to statistics.
Calculus can be boiled down to two forms: (1) differential calculus is all about finding rates of changes by calculating derivatives, or slopes, while (2) integral calculus is all about finding total amounts, or areas, by adding infinitesimally small things together. According to the fundamental theorem of calculus, these two types are actually the inverse of each other—you can find the total area under a curve based on its slope, for instance. Super neat stuff. If you want a cool accessible refresher / history of all this, check out Steven Strogatz’s Infinite Powers: How Calculus Reveals the Secrets of the Universe—it’s great.
In the world of statistics and marginal effects all we care about are slopes, which are solely a differential calculus idea.
Let’s pretend we have a line that shows the relationship between and that’s defined with an equation using the form , where is the slope and is the yintercept. We can plot it with ggplot using the helpful geom_function()
function:
# y = 2x  1
a_line < function(x) (2 * x)  1
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_function(fun = a_line, size = 1, color = clrs[2]) +
scale_x_continuous(breaks = 2:5, limits = c(1, 3)) +
scale_y_continuous(breaks = 3:9) +
annotate(geom = "segment", x = 1, y = 1.3, xend = 1, yend = 3, color = clrs[4], size = 0.5) +
annotate(geom = "segment", x = 1, y = 3, xend = 1.8, yend = 3, color = clrs[4], size = 0.5) +
annotate(geom = "richtext", x = 1.4, y = 3.1, label = "Slope: **2**", vjust = 0) +
labs(x = "x", y = "y") +
coord_equal() +
theme_mfx()
The line crosses the yaxis at 1, and its slope, or its is 2, or , meaning that we go up two units and to the right one unit.
Importantly, the slope shows the relationship between and . If increases by 1 unit, increases by 2: when is 1, is 1; when is 2, is 3, and so on. We can call this the marginal effect, or the change in that results from one additional .
We can think about this slope using calculus language too. In differential calculus, slopes are called derivatives and they represent the change in that results from changes in , or . The here refers to an infinitesimal change in the values of and , rather than a oneunit change like we think of when looking at the slope as . Even more technically, the indicates that we’re working with the total derivative, since there’s only one variable () to consider. If we had more variables (like ), we would need to find the partial derivative for , holding constant, and we’d write the derivative with a symbol instead: . More on that in a bit.
By plotting this line, we can figure out visually—the slope is 2. But we can figure it out mathematically too. Differential calculus is full of fancy tricks and rules of thumb for figuring out derivatives, like the power rule, the chain rule, and so on. The easiest one for me to remember is the power rule, which says you can find the slope of a variable like by decreasing its exponent by 1 and multiplying that exponent by the variable’s coefficient. All constants (terms without ) disappear.
(My secret is that I only know the power rule and so I avoid calculus at all costs and either use R or use Wolfram Alpha—go to Wolfram Alpha, type in derivative y = 2x  1
and you’ll see some magic.)
We thus know that the derivative of is . At every point on this line, the slope is 2—it never changes.
slope_annotations < tibble(x = c(0.25, 1.2, 2.4)) >
mutate(y = a_line(x)) >
mutate(nice_y = y + 1) >
mutate(nice_label = glue("x: {x}; y: {y}<br>",
"Slope (dy/dx): **{2}**"))
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_function(fun = a_line, size = 1, color = clrs[2]) +
geom_point(data = slope_annotations, aes(x = x, y = y)) +
geom_richtext(data = slope_annotations,
aes(x = x, y = y, label = nice_label),
nudge_y = 0.5) +
scale_x_continuous(breaks = 2:5, limits = c(1, 3)) +
scale_y_continuous(breaks = 3:9) +
labs(x = "x", y = "y") +
coord_equal() +
theme_mfx()
The power rule seems super basic for equations with nonexponentiated s, but it’s really helpful with more complex equations, like this parabola :
# y = 0.5x^2 + 5x + 5
a_parabola < function(x) (0.5 * x^2) + (5 * x) + 5
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_function(fun = a_parabola, size = 1, color = clrs[2]) +
xlim(5, 15) +
labs(x = "x", y = "y") +
coord_cartesian(ylim = c(5, 20)) +
theme_mfx()
What’s interesting here is that there’s no longer a single slope for the whole function. The steepness of the slope across a range of s depends on whatever currently is. The curve is steeper at really low and really high values of and it is shallower around 5 (and it is completely flat when is 5).
If we apply the power rule to the parabola formula we can find the exact slope:
When is 0, the slope is 5 (); when is 8, the slope is −3 (), and so on. We can visualize this if we draw some lines tangent to some different points on the equation. The slope of each of these tangent lines represents the instantaneous slope of the parabola at each value.
# dy/dx = x + 5
parabola_slope < function(x) (x) + 5
slope_annotations < tibble(
x = c(0, 3, 8)
) >
mutate(y = a_parabola(x),
slope = parabola_slope(x),
intercept = find_intercept(x, y, slope),
nice_slope = glue("Slope (dy/dx)<br><span style='fontsize:12pt;color:{clrs[4]}'>**{slope}**</span>"))
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_function(fun = a_parabola, size = 1, color = clrs[2]) +
geom_abline(data = slope_annotations,
aes(slope = slope, intercept = intercept),
size = 0.5, color = clrs[4], linetype = "21") +
geom_point(data = slope_annotations, aes(x = x, y = y),
size = 3, color = clrs[4]) +
geom_richtext(data = slope_annotations, aes(x = x, y = y, label = nice_slope),
nudge_y = 2) +
xlim(5, 15) +
labs(x = "x", y = "y") +
coord_cartesian(ylim = c(5, 20)) +
theme_mfx()
And here’s an animation of what the slope looks like across a whole range of s. Neat!
In the calculus world, the term “marginal” isn’t used all that often. Instead they talk about derivatives. But in the end, all these marginal/derivative things are just slopes.
Before looking at how this applies to the world of statistics, let’s look at a quick example from economics, since economists also use the word “marginal” to refer to slopes. My first exposure to the word “marginal” meaning “changes in things” wasn’t actually in the world of statistics, but in economics. I took my first microeconomics class as a firstyear MPA student in 2010 (and hated it; ironically I teach it now 🤷).
One common question in microeconomics relates to how people maximize their happiness, or utility, under budget constraints (see here for an Rbased example). Economists imagine that people have utility functions in their heads that take inputs and convert them to utility (or happiness points). For instance, let’s pretend that the happiness/utility () you get from the number of cookies you eat () is defined like this:
Here’s what that looks like:
# u = 0.5x^2 + 5x
u_cookies < function(x) (0.5 * x^2) + (5 * x)
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_function(fun = u_cookies, size = 1, color = clrs[2]) +
scale_x_continuous(breaks = seq(0, 12, 2), limits = c(0, 12)) +
labs(x = "Cookies", y = "Utility (happiness points)") +
theme_mfx()
This parabola represents your total utility from cookies. Eat 1 cookie, get 4.5 happiness points; eat 3 cookies, get 10.5 points; eat 6, get 12 points; and so on.
The marginal utility, on the other hand, tells how how much more happiness you’d get from eating one more cookie. If you’re currently eating 1, how many more happiness points would you get by moving to 2? If if you’re eating 7, what would happen to your happiness if you moved to 8? We can figure this out by looking at the slope of the parabola, which will show us the instantaneous rate of change, or marginal utility, for any number of cookies.
Power rule time! (or type derivative 0.5x^2 + 5x
at Wolfram Alpha)
Let’s plot this really quick too:
# du/dx = x + 5
mu_cookies < function(x) x + 5
ggplot() +
geom_vline(xintercept = 0, size = 0.5, color = "grey50") +
geom_hline(yintercept = 0, size = 0.5, color = "grey50") +
geom_vline(xintercept = 5, size = 0.5,
linetype = "21", color = clrs[3]) +
geom_function(fun = mu_cookies, size = 1, color = clrs[5]) +
scale_x_continuous(breaks = seq(0, 12, 2), limits = c(0, 12)) +
labs(x = "Cookies", y = "Marginal utility (additional happiness points)") +
theme_mfx()
If you’re currently eating 1 cookie and you grab another one, you’ll gain 4 extra or marginal happiness points. If you’re eating 6 and you grab another one, you’ll actually lose some happiness—the marginal utility at 6 is 1. If you’re an economist who wants to maximize your happiness, you should eat the number of cookies where the extra happiness you’d get is 0, or where marginal utility is 0:
Eat 5 cookies, maximize your happiness. Eat any more and you’ll start getting disutility (like a stomachache). This is apparent in the marginal utility plot too. All the values of marginal utility to the left of 5 are positive; all the values to the right of 5 are negative. Economists call this decreasing marginal utility.
This relationship between total utility and marginal utility is even more apparent if we look at both simultaneously (for fun I included the second derivative (), or the slope of the first derivative, in the marginal utility panel):
Marginal utility, marginal revenue, marginal costs, and all those other marginal things are great for economists, but how does this “marginal” concept relate to statistics? Is it the same?
Yep! Basically!
At its core, regression modeling in statistics is all about fancy ways of finding averages and fancy ways of drawing lines. Even if you’re doing nonregression things like ttests, those are technically still just regression behind the scenes.
Statistics is all about lines, and lines have slopes, or derivatives. These slopes represent the marginal changes in an outcome. As you move an independent/explanatory variable , what happens to the dependent/outcome variable ?
Before getting into the mechanics of statistical marginal effects, it’s helpful to review what exactly regression coefficients are doing in statistical models, especially when dealing with both continuous and categorical explanatory variables.
When I teach statistics to my students, my favorite analogy for regression is to think of sliders and switches. Sliders represent continuous variables: as you move them up and down, something gradual happens to the resulting light. Switches represent categorical variables: as you turn them on and off, there are larger overall changes to the resulting light.
Let’s look at some super tiny quick models to illustrate this, using data from palmerpenguins:
penguins < penguins > drop_na()
model_slider < lm(body_mass_g ~ flipper_length_mm, data = penguins)
tidy(model_slider)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5872. 310. 18.9 1.18e 54
## 2 flipper_length_mm 50.2 1.54 32.6 3.13e105
model_switch < lm(body_mass_g ~ species, data = penguins)
tidy(model_switch)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3706. 38.1 97.2 6.88e245
## 2 speciesChinstrap 26.9 67.7 0.398 6.91e 1
## 3 speciesGentoo 1386. 56.9 24.4 1.01e 75
Disregard the intercept for now and just look at the coefficients for flipper_length_mm
and species*
. Flipper length is a continuous variable, so it’s a slider—as flipper length increases by 1 mm, penguin body mass increases by 50 grams. Slide it up more and you’ll see a bigger increase: if flipper length increases by 10 mm, body mass should increase by 500 grams. Slide it down for fun too! If flipper length decreases by 1 mm, body mass decreases by 50 grams. Imagine it like a sliding light switch.
Species, on the other hand, is a switch. There are three possible values here: Adelie, Chinstrap, and Gentoo. The base case in the results here is Adelie since it comes fist alphabetically. The coefficients for speciesChinstrap
and speciesGentoo
aren’t sliders—you can’t talk about oneunit increases in Gentooness or Chinstrapness. Instead, the values show what happens in relation to the average weight of Adelie penguins if you flip the Chinstrap or Gentoo switch. Chinstrap penguins are 29 grams heavier than Adelie penguins on average, while the chonky Gentoo penguins are 1.4 kg heavier than Adellie penguins. With these categorical coefficients, we’re flipping a switch on and off: Adelie vs. Chinstrap and Adelie vs. Gentoo.
This slider and switch analogy holds when thinking about multiple regression too, though we need to think of lots of sliders and switches, like in an audio mixer board:
With a mixer board, we can move many different sliders up and down and use different combinations of switches, all of which ultimately influence the audio output.
Let’s make a more complex mixerboardesque regression model with multiple continuous (slider) and categorical (switch) explanatory variables:
model_mixer < lm(body_mass_g ~ flipper_length_mm + bill_depth_mm + species + sex,
data = penguins)
tidy(model_mixer)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1212. 568. 2.13 3.36e 2
## 2 flipper_length_mm 17.5 2.87 6.12 2.66e 9
## 3 bill_depth_mm 74.4 19.7 3.77 1.91e 4
## 4 speciesChinstrap 78.9 45.5 1.73 8.38e 2
## 5 speciesGentoo 1154. 119. 9.73 8.02e20
## 6 sexmale 435. 44.8 9.72 8.79e20
Interpreting these coefficients is a little different now, since we’re working with multiple moving parts. In regular stats class, you’ve probably learned to say something like “Holding all other variables constant, a 1 mm increase in flipper length is associated with a 17.5 gram increase in body mass, on average” (slider) or “Holding all other variables constant, Chinstrap penguins are 79 grams lighter than Adelie penguins, on average” (switch).
This idea of “holding everything constant” though can be tricky to wrap your head around. Imagining this model like a mixer board can help, though. Pretend that you set the bill depth slider to some value (0, the average, whatever), you flip the Chinstrap and Gentoo switches off, you flip the male switch off, and then you slide only the flipper length switch up and down. You’d be looking at the marginal effect of flipper length for female Adelie penguins with an average (or 0 or whatever) length of bill depth. Stop moving the flipper length slider and start moving the bill depth slider and you’ll see the marginal effect of bill depth for female Adelie penguins. Flip on the male switch and you’ll see the marginal effect of bill depth for male Adelie penguins. Flip on the Gentoo switch and you’ll see the marginal effect of bill depth for male Gentoo penguins. And so on.
In calculus, if you have a model like model_slider
with just one continuous variable, the slope or derivative of that variable is the total derivative, or . If you have a model like model_mixer
with lots of other variables, the slope or derivative of any of the individual explanatory variables is the partial derivative, or , where all other variables are held constant.
Oops. When talking about these penguin regression results up there ↑ I used the term “marginal effect,” but we haven’t officially defined it in the statistics world yet. It’s tricky to do that, though, because there are so many synonyms and near synonyms for the idea of a statistical effect, like marginal effect, marginal mean, marginal slope, conditional effect, conditional marginal effect, and so on.
Formally defined, a marginal effect is a partial derivative from a regression equation. It’s the instantaneous slope of one of the explanatory variables in a model, with all the other variables held constant. If we continue with the mixing board analogy, it represents what would happen to the resulting audio levels if we set all sliders and switches to some stationary level and we moved just one slider up a tiny amount.
However, in practice, people use the term “marginal effect” to mean a lot more than just a partial derivative. For instance, in a randomized controlled trial, the difference in group means between the treatment and control groups is often called a marginal effect (and sometimes called a conditional effect, or even a conditional marginal effect). The term is also often used to talk about other group differences, like differences in penguin weights across species.
In my mind, all these quasisynonymous terms represent the same idea of a statistical effect, or what would happen to an outcome if one of the explanatory variables (be it continuous, categorical, or whatever) were different. The more precise terms like marginal effect, conditional effect, marginal mean, and so on, are variations on this theme. This is similar to how a square is a rectangle, but a rectangle is not a square—they’re all super similar, but with minor subtle differences depending on the type of we’re working with:
Let’s look at true marginal effects, or the partial derivatives of continuous variables in a model (or sliders, in our slider/switch analogy). For the rest of this post, we’ll move away from penguins and instead look at some crossnational data about the relationship between public sector corruption, the legal requirement to disclose donations to political campaigns, and respect for human rights, since that’s all more related to what I do in my own research (I know nothing about penguins). We’ll explore two different political science/policy questions:
We’ll use data from the World Bank and from the Varieties of Democracy project and just look at one year of data (2020) so we don’t have to worry about panel data. There’s a great R package for accessing VDem data without needing to download it manually from their website, but it’s not on CRAN—it has to be installed from GitHub.
VDem and the World Bank have hundreds of different variables, but we only need a few, and we’ll make a few adjustments to the ones we do need. Here’s what we’ll do:
Main continuous outcome and continuous explanatory variable: Public sector corruption index (v2x_pubcorr
in VDem). This is a 0–1 scale that measures…
To what extent do public sector employees grant favors in exchange for bribes, kickbacks, or other material inducements, and how often do they steal, embezzle, or misappropriate public funds or other state resources for personal or family use?
Higher values represent worse corruption.
Main binary outcome: Disclosure of campaign donations (v2eldonate_ord
in VDem). This is an ordinal variable with these possible values:
 0: No. There are no disclosure requirements.
 1: Not really. There are some, possibly partial, disclosure requirements in place but they are not observed or enforced most of the time.
 2: Ambiguous. There are disclosure requirements in place, but it is unclear to what extent they are observed or enforced.
 3: Mostly. The disclosure requirements may not be fully comprehensive (some donations not covered), but most existing arrangements are observed and enforced.
 4: Yes. There are comprehensive requirements and they are observed and enforced almost all the time.
For the sake of simplicity, we’ll collapse this into a binary variable. Countries have disclosure laws if they score a 3 or a 4; they don’t if they score a 0, 1, or 2.
Other continuous explanatory variables:
v2x_polyarchy
in VDem): a continuous variable measured from 0–1 with higher values representing greater achievement of democratic idealsv2x_civlib
in VDem): a continuous variable measured from 0–1 with higher values representing better respect for human rights and civil libertiesNY.GDP.PCAP.KD
at the World Bank): GDP per capita in constant 2015 USDRegion: VDem provides multiple regional variables with varying specificity (19 different regions, 10 different regions, and 6 different regions). We’ll use the 6region version (e_regionpol_6C
) for simplicity here:
 1: Eastern Europe and Central Asia (including Mongolia)
 2: Latin America and the Caribbean
 3: The Middle East and North Africa (including Israel and Turkey, excluding Cyprus)
 4: SubSaharan Africa
 5: Western Europe and North America (including Cyprus, Australia and New Zealand)
 6: Asia and Pacific (excluding Australia and New Zealand)
# Get data from the World Bank's API
wdi_raw < WDI(country = "all",
indicator = c(population = "SP.POP.TOTL",
gdp_percapita = "NY.GDP.PCAP.KD"),
start = 2000, end = 2020, extra = TRUE)
# Clean up the World Bank data
wdi_2020 < wdi_raw >
filter(region != "Aggregates") >
filter(year == 2020) >
mutate(log_gdp_percapita = log(gdp_percapita)) >
select(region, status, year, country, lastupdated, lending)
# Get data from VDem and clean it up
vdem_2020 < vdem %>%
select(country_name, country_text_id, year, region = e_regionpol_6C,
disclose_donations_ord = v2eldonate_ord,
public_sector_corruption = v2x_pubcorr,
polyarchy = v2x_polyarchy, civil_liberties = v2x_civlib) %>%
filter(year == 2020) %>%
mutate(disclose_donations = disclose_donations_ord >= 3,
disclose_donations = ifelse(is.na(disclose_donations), FALSE, disclose_donations)) %>%
# Scale these up so it's easier to talk about 1unit changes
mutate(across(c(public_sector_corruption, polyarchy, civil_liberties), ~ . * 100)) >
mutate(region = factor(region,
labels = c("Eastern Europe and Central Asia",
"Latin America and the Caribbean",
"Middle East and North Africa",
"SubSaharan Africa",
"Western Europe and North America",
"Asia and Pacific")))
# Combine World Bank and VDem data into a single dataset
corruption < vdem_2020 >
left_join(wdi_2020, by = c("country_text_id" = "iso3c")) >
drop_na(gdp_percapita)
glimpse(corruption)
## Rows: 168
## Columns: 17
## $ country_name <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana",…
## $ country_text_id <chr> "MEX", "SUR", "SWE", "CHE", "GHA", "ZAF", "JPN", "MMR",…
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ region <fct> Latin America and the Caribbean, Latin America and the …
## $ disclose_donations_ord <dbl> 3, 1, 2, 0, 2, 1, 3, 2, 3, 2, 2, 0, 3, 3, 4, 3, 4, 2, 1…
## $ public_sector_corruption <dbl> 48.8, 24.8, 1.3, 1.4, 65.2, 57.1, 3.7, 36.8, 70.6, 71.2…
## $ polyarchy <dbl> 64.7, 76.1, 90.8, 89.4, 72.0, 70.3, 83.2, 43.6, 26.2, 4…
## $ civil_liberties <dbl> 71.2, 87.7, 96.9, 94.8, 90.4, 82.2, 92.8, 56.9, 43.0, 8…
## $ disclose_donations <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, T…
## $ iso2c <chr> "MX", "SR", "SE", "CH", "GH", "ZA", "JP", "MM", "RU", "…
## $ population <dbl> 1.29e+08, 5.87e+05, 1.04e+07, 8.64e+06, 3.11e+07, 5.93e…
## $ gdp_percapita <dbl> 8923, 7530, 51542, 85685, 2021, 5659, 34556, 1587, 9711…
## $ capital <chr> "Mexico City", "Paramaribo", "Stockholm", "Bern", "Accr…
## $ longitude <chr> "99.1276", "55.1679", "18.0645", "7.44821", "0.20795…
## $ latitude <chr> "19.427", "5.8232", "59.3327", "46.948", "5.57045", "2…
## $ income <chr> "Upper middle income", "Upper middle income", "High inc…
## $ log_gdp_percapita <dbl> 9.10, 8.93, 10.85, 11.36, 7.61, 8.64, 10.45, 7.37, 9.18…
Let’s start off by looking at the effect of civil liberties on public sector corruption by using a really simple model with one explanatory variable:
plot_corruption < corruption >
mutate(highlight = civil_liberties == min(civil_liberties) 
civil_liberties == max(civil_liberties))
ggplot(plot_corruption, aes(x = civil_liberties, y = public_sector_corruption)) +
geom_point(aes(color = highlight)) +
stat_smooth(method = "lm", formula = y ~ x, size = 1, color = clrs[1]) +
geom_label_repel(data = filter(plot_corruption, highlight == TRUE),
aes(label = country_name), seed = 1234) +
scale_color_manual(values = c("grey30", clrs[3]), guide = "none") +
labs(x = "Civil liberties index", y = "Public sector corruption index") +
theme_mfx()
We have a nice fitted OLS line here with uncertainty around it. What’s the marginal effect of civil liberties on public sector corruption? What kind of calculus and math do we need to do to find it? Not much, happily!
In general, we have a regression formula here that looks a lot like the stuff we were using before, only now the intercept is and the slope is . If we use the power rule to find the first derivative of this equation, we’ll see that the slope of the entire line is :
If we add actual coefficients from the model into the formula we can see that the coefficient for civil_liberties
(−0.80) is indeed the marginal effect:
The coefficient by itself is thus enough to tell us what the effect of moving civil liberties around is—it is the marginal effect of civil liberties on public sector corruption. Slide the civil liberties index up by 1 point and public sector corruption will be −0.80 points lower, on average.
Importantly, this is only the case because we’re using simple linear regression without any curvy parts. If your model is completely linear without any polynomials or logs or interaction terms or doesn’t use curvy regression families like logistic or beta regression, you can use individual coefficients as marginal effects.
Let’s see what happens when we add curves. We’ll add a polynomial term, including both civil_liberties
and civil_liberties^2
so that we can capture the parabolic shape of the relationship between civil liberties and corruption:
ggplot(plot_corruption, aes(x = civil_liberties, y = public_sector_corruption)) +
geom_point(aes(color = highlight)) +
stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1, color = clrs[2]) +
geom_label_repel(data = filter(plot_corruption, highlight == TRUE),
aes(label = country_name), seed = 1234) +
scale_color_manual(values = c("grey30", clrs[3]), guide = "none") +
labs(x = "Civil liberties index", y = "Public sector corruption index") +
theme_mfx()
This is most likely not a great model fit in real life, but using the quadratic term here makes a neat curved line, so we’ll go with it for the sake of the example. But don’t, like, make any policy decisions based on this line.
When working with polynomials in regression, the coefficients appear and work a little differently:
model_sq < lm(public_sector_corruption ~ civil_liberties + I(civil_liberties^2),
data = corruption)
tidy(model_sq)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 41.9 11.6 3.60 0.000427
## 2 civil_liberties 1.58 0.419 3.77 0.000230
## 3 I(civil_liberties^2) 0.0197 0.00341 5.77 0.0000000382
We now have two coefficients for civil liberties: and . Importantly, we cannot use just one of these to talk about the marginal effect of changing civil liberties. A onepoint increase in the civil liberties index is not associated with a 1.58 increase or a −0.02 decrease in corruption. The slope of the fitted line now comprises multiple moving parts: (1) the coefficient for the nonsquared term, (2) the coefficient for the squared term, and (3) some value of civil liberties, since the slope isn’t the same across the whole line. The math shows us why and how.
We have terms for both and in our model. To find the derivative, we can use the power rule to get rid of the term (), but the in the term doesn’t disappear (). The slope of the line thus depends on both the βs and the :
Here’s what that looks like with the results of our civil liberties and corruption model:
Because the actual slope depends on the value of civil liberties, we need to plug in different values to get the instantaneous slopes at each value. Let’s plug in 25, 55, and 80, for fun:
# Extract the two civil_liberties coefficients
civ_lib1 < tidy(model_sq) > filter(term == "civil_liberties") > pull(estimate)
civ_lib2 < tidy(model_sq) > filter(term == "I(civil_liberties^2)") > pull(estimate)
# Make a little function to do the math
civ_lib_slope < function(x) civ_lib1 + (2 * civ_lib2 * x)
civ_lib_slope(c(25, 55, 80))
## [1] 0.594 0.587 1.572
We have three different slopes now: 0.59, −0.59, and −1.57 for civil liberties of 25, 55, and 80, respectively. We can plot these as tangent lines:
tangents < model_sq >
augment(newdata = tibble(civil_liberties = c(25, 55, 80))) >
mutate(slope = civ_lib_slope(civil_liberties),
intercept = find_intercept(civil_liberties, .fitted, slope)) >
mutate(nice_label = glue("Civil liberties: {civil_liberties}<br>",
"Fitted corruption: {nice_number(.fitted)}<br>",
"Slope: **{nice_number(slope)}**"))
ggplot(corruption, aes(x = civil_liberties, y = public_sector_corruption)) +
geom_point(color = "grey30") +
stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1, se = FALSE, color = clrs[4]) +
geom_abline(data = tangents, aes(slope = slope, intercept = intercept),
size = 0.5, color = clrs[2], linetype = "21") +
geom_point(data = tangents, aes(x = civil_liberties, y = .fitted), size = 4, shape = 18, color = clrs[2]) +
geom_richtext(data = tangents, aes(x = civil_liberties, y = .fitted, label = nice_label), nudge_y = 7) +
labs(x = "Civil liberties index", y = "Public sector corruption index") +
theme_mfx()
Doing the calculus by hand here is tedious though, especially once we start working with even more covariates in a model. Plus we don’t have any information about uncertainty, like standard errors and confidence intervals. There are official mathy ways to figure those out by hand, but who even wants to do that. Fortunately there are two different packages that let us find marginal slopes automatically, with important differences in their procedures, which we’ll explore in detail below. But before looking at their differences, let’s first see how they work.
First, we can use the marginaleffects()
function from marginaleffects to see the slope (the dydx
column here) at various levels of civil liberties. We’ll look at the mechanics of this function in more detail in the next section—for now we’ll just plug in our three values of civil liberties and see what happens. We’ll also set the eps
argument: behind the scenes, marginaleffects()
doesn’t actually do the byhand calculus of piecing together first derivatives—instead, it calculates the fitted value of corruption when civil liberties is a value, calculates the fitted value of corruption when civil liberties is that same value plus a tiny bit more, and then subtracts them. The eps
value controls that tiny amount. In this case, it’ll calculate the predictions for civil_liberties = 25
and civil_liberties = 25.001
and then find the slope of the tiny tangent line between those two points. It’s a neat little mathy trick to avoid calculus.
model_sq >
marginaleffects(newdata = datagrid(civil_liberties = c(25, 55, 80)),
eps = 0.001)
## rowid type term dydx std.error statistic p.value conf.low conf.high predicted predicted_hi predicted_lo public_sector_corruption civil_liberties eps
## 1 1 response civil_liberties 0.594 0.2527 2.35 1.87e02 0.0991 1.090 69.0 69.0 69.0 45.8 25 0.001
## 2 2 response civil_liberties 0.587 0.0806 7.28 3.27e13 0.7452 0.429 69.1 69.1 69.1 45.8 55 0.001
## 3 3 response civil_liberties 1.572 0.1509 10.42 2.04e25 1.8676 1.276 42.2 42.2 42.2 45.8 80 0.001
Second, we can use the emtrends()
function from emmeans to also see the slope (the civil_liberties.trend
column here) at various levels of civil liberties. The syntax is different (note the delta.var
argument instead of eps
), but the results are essentially the same:
model_sq >
emtrends(~ civil_liberties, var = "civil_liberties",
at = list(civil_liberties = c(25, 55, 80)),
delta.var = 0.001)
## civil_liberties civil_liberties.trend SE df lower.CL upper.CL
## 25 0.594 0.2527 165 0.095 1.093
## 55 0.587 0.0806 165 0.746 0.428
## 80 1.572 0.1509 165 1.870 1.274
##
## Confidence level used: 0.95
Both marginaleffects()
and emtrends()
also helpfully provide uncertainty, with standard errors and confidence intervals, with a lot of super fancy math behind the scenes to make it all work. marginaleffects()
provides pvalues automatically; if you want pvalues from emtrends()
you need to wrap it in test()
:
model_sq >
emtrends(~ civil_liberties, var = "civil_liberties",
at = list(civil_liberties = c(25, 55, 80)),
delta.var = 0.001) >
test()
## civil_liberties civil_liberties.trend SE df t.ratio p.value
## 25 0.594 0.2527 165 2.350 0.0198
## 55 0.587 0.0806 165 7.280 <.0001
## 80 1.572 0.1509 165 10.420 <.0001
Another neat thing about these more automatic functions is that we can use them to create a marginal effects plot, placing the value of the slope on the yaxis rather than the fitted value of public corruption. marginaleffects helpfully has plot_cme()
that will plot the values of dydx
across the whole range of civil liberties automatically. Alternatively, if we want full control over the plot, we can use either marginaleffects()
or emtrends()
to create a data frame that we can plot ourselves with ggplot:
# Automatic plot from marginaleffects::plot_cme()
mfx_marginaleffects_auto < plot_cme(model_sq,
effect = "civil_liberties",
condition = "civil_liberties") +
labs(x = "Civil liberties", y = "Marginal effect of civil liberties on public sector corruption",
subtitle = "Created automatically with marginaleffects::plot_cme()") +
theme_mfx()
# Piece all the geoms together manually with results from marginaleffects::marginaleffects()
mfx_marginaleffects < model_sq >
marginaleffects(newdata = datagrid(civil_liberties =
seq(min(corruption$civil_liberties),
max(corruption$civil_liberties), 0.1)),
eps = 0.001) >
ggplot(aes(x = civil_liberties, y = dydx)) +
geom_vline(xintercept = 42, color = clrs[3], size = 0.5, linetype = "24") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, fill = clrs[1]) +
geom_line(size = 1, color = clrs[1]) +
labs(x = "Civil liberties", y = "Marginal effect of civil liberties on public sector corruption",
subtitle = "Calculated with marginaleffects()") +
theme_mfx()
# Piece all the geoms together manually with results from emmeans::emtrends()
mfx_emtrends < model_sq >
emtrends(~ civil_liberties, var = "civil_liberties",
at = list(civil_liberties =
seq(min(corruption$civil_liberties),
max(corruption$civil_liberties), 0.1)),
delta.var = 0.001) >
as_tibble() >
ggplot(aes(x = civil_liberties, y = civil_liberties.trend)) +
geom_vline(xintercept = 42, color = clrs[3], size = 0.5, linetype = "24") +
geom_ribbon(aes(ymin = lower.CL, ymax = upper.CL), alpha = 0.1, fill = clrs[1]) +
geom_line(size = 1, color = clrs[1]) +
labs(x = "Civil liberties", y = "Marginal effect of civil liberties on public sector corruption",
subtitle = "Calculated with emtrends()") +
theme_mfx()
mfx_marginaleffects_auto  mfx_marginaleffects  mfx_emtrends
This kind of plot is useful since it shows precisely how the effect changes across civil liberties. The slope is 0 at around 42, positive before that, and negative after that, which—assuming this is a good model and who even knows if that’s true—implies that countries with low levels of respect for civil liberties will see an increase in corruption as civil liberties increases, while countries with high respect for civil liberties will see a decrease in corruption as they improve their respect for human rights.
Finding marginal effects for lines like and with calculus is fairly easy since there’s no uncertainty involved. Finding marginal effects for fitted lines from a regression model, on the other hand, is more complicated because uncertainty abounds. The estimated partial slopes all have standard errors and measures of statistical significance attached to them. The slope of civil liberties at 55 is −0.59, but it could be higher and it could be lower. Could it even possibly be zero? Maybe! (But most likely not; the pvalue that we saw above is less than 0.001, so there’s only a sliver of a chance of seeing a slope like −0.59 in a world where it is actually 0ish).
We deal with the uncertainty of these marginal effects by taking averages, which is why we talk about “average marginal effects” when interpreting these effects. So far, marginaleffects::marginaleffects()
and emmeans::emtrends()
have given identical results. But behind the scenes, these packages take two different approaches to calculating these marginal averages. The difference is very subtle, but incredibly important.
Let’s look at how these two packages calculate their marginal effects by default.
By default, marginaleffects calculates the average marginal effect (AME) for its partial slopes/coefficients. To do this, it follows a specific process of averaging:
It first plugs each row of the original dataset into the model and generates predictions for each row. It then uses fancy math (i.e. adding 0.001) to calculate the instantaneous slope for each row and stores each individual slope in the dydx
column here:
mfx_sq < marginaleffects(model_sq)
head(mfx_sq)
## rowid type term dydx std.error statistic p.value conf.low conf.high predicted predicted_hi predicted_lo public_sector_corruption civil_liberties civil_liberties.1 eps
## 1 1 response civil_liberties 1.23 0.102 12.02 2.91e33 1.43 1.03 54.46 54.45 54.46 48.8 71.2 5069 0.00857
## 2 2 response civil_liberties 1.88 0.199 9.43 3.94e21 2.26 1.49 28.88 28.87 28.88 24.8 87.7 7691 0.00857
## 3 3 response civil_liberties 2.24 0.258 8.66 4.71e18 2.74 1.73 9.97 9.95 9.97 1.3 96.9 9390 0.00857
## 4 4 response civil_liberties 2.16 0.245 8.81 1.27e18 2.63 1.68 14.58 14.56 14.58 1.4 94.8 8987 0.00857
## 5 5 response civil_liberties 1.98 0.216 9.17 4.70e20 2.41 1.56 23.68 23.66 23.68 65.2 90.4 8172 0.00857
## 6 6 response civil_liberties 1.66 0.164 10.10 5.73e24 1.98 1.34 38.60 38.59 38.60 57.1 82.2 6757 0.00857
It finally calculates the average of the dydx
column. We can do that ourselves:
Or we can feed a marginaleffects
object to summary()
or tidy()
, which will calculate the correct uncertainty statistics, like the standard errors:
summary(mfx_sq)
## Term Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 civil_liberties 1.17 0.0948 12.3 <2e16 1.35 0.98
##
## Model type: lm
## Prediction type: response
Note that the average marginal effect here isn’t the same as what we saw before when we set civil liberties to different values. In this case, the effect is averaged across the whole range of civil liberties—one single grand average mean. It shows that in general, the overall average slope of the fitted line is −1.17.
Don’t worry about the number too much here—we’re just exploring the underlying process of calculating this average marginal effect. In general, as the image shows above, for average marginal effects, we take the full original data, feed it to the model, generate fitted values for each original row, and then collapse the results into a single value.
The main advantage of doing this is that each dydx
prediction uses values that exist in the actual data. The first dydx
slope estimate is for Mexico in 2020 and is based on Mexico’s actual value of civil_liberties
(and any other covariates if we had included any others in the model). It’s thus more reflective of reality.
A different approach for this averaging is to calculate the marginal effect at the mean, or MEM. This is what the emmeans package does by default. (The emmeans package actually calculates two average things: “marginal effects at the means” (MEM), or average slopes using emtrends()
, and “estimated marginal means” (EMM), or average predictions using emmeans()
. It’s named after the second of these, hence the name emmeans).
To do this, we follow a slightly different process of averaging:
First, we calculate the average value of each of the covariates in the model (in this case, just civil_liberties
):
avg_civ_lib < mean(corruption$civil_liberties)
avg_civ_lib
## [1] 69.7
We then plug that average (and that average plus 0.001) into the model and generate fitted values:
Because of rounding (and because the values are so tiny), this looks like the two rows are identical, but they’re not—the second one really is 0.001 more than 69.682.
We then subtract the two and divide by 0.001 to get the final marginal effect at the mean:
(civ_lib_fitted[2,2]  civ_lib_fitted[1,2]) / 0.001
## .fitted
## 1 1.17
That doesn’t give us any standard errors or uncertainty or anything, so it’s better to use emtrends()
or marginaleffects()
. emtrends()
calculates this MEM automatically:
model_sq >
emtrends(~ civil_liberties, var = "civil_liberties", delta.var = 0.001)
## civil_liberties civil_liberties.trend SE df lower.CL upper.CL
## 69.7 1.17 0.0948 165 1.35 0.978
##
## Confidence level used: 0.95
We can also calculate the MEM with marginaleffects()
if we include the newdata = "mean"
argument, which will automatically shrink the original data down into average or typical values:
model_sq >
marginaleffects(newdata = "mean") >
summary()
## Term Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 civil_liberties 1.17 0.0948 12.3 <2e16 1.35 0.98
##
## Model type: lm
## Prediction type: response
The disadvantage of this approach is that no actual country has a civil_liberties
score of exactly 69.682. If we had other covariates in the model, no country would have exactly the average of every variable. The marginal effect is thus calculated based on a hypothetical country that might not possibly exist in real life.
So far, comparing average marginal effects (AME) with marginal effects at the mean (MEM) hasn’t been that useful, since both marginaleffects()
and emtrends()
provided nearly identical results with our simple model with civil liberties squared. That’s because nothing that strange is going on in the model—there are no additional explanatory variables, no interactions or logs, and we’re using OLS and not anything fancy like logistic regression or beta regression.
Things change once we leave the land of OLS.
Let’s make a new model that predicts if a country has campaign finance disclosure laws based on public sector corruption. Disclosure laws is a binary outcome, so we’ll use logistic regression to constrain the fitted values and predictions to between 0 and 1.
plot_corruption_logit < corruption >
mutate(highlight = public_sector_corruption == min(public_sector_corruption) 
public_sector_corruption == max(public_sector_corruption))
ggplot(plot_corruption_logit,
aes(x = public_sector_corruption, y = as.numeric(disclose_donations))) +
geom_point(aes(color = highlight)) +
geom_smooth(method = "glm", method.args = list(family = binomial(link = "logit")),
color = clrs[2]) +
geom_label(data = slice(filter(plot_corruption_logit, highlight == TRUE), 1),
aes(label = country_name), nudge_y = 0.06, hjust = 1) +
geom_label(data = slice(filter(plot_corruption_logit, highlight == TRUE), 2),
aes(label = country_name), nudge_y = 0.06, hjust = 0) +
scale_color_manual(values = c("grey30", clrs[3]), guide = "none") +
labs(x = "Public sector corruption",
y = "Presence or absence of\ncampaign finance disclosure laws\n(Line shows predicted probability)") +
theme_mfx()
Even without any squared terms, we’re already in nonlinear land. We can build a model and explore this relationship:
model_logit < glm(
disclose_donations ~ public_sector_corruption,
family = binomial(link = "logit"),
data = corruption
)
tidy(model_logit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.98 0.388 5.09 3.51e 7
## 2 public_sector_corruption 0.0678 0.00991 6.84 7.85e12
The coefficients here are on a different scale and are measured in log odds units (or logits), not probabilities or percentage points. That means we can’t use those coefficients directly. We can’t say things like “a oneunit increase in public sector corruption is associated with a −0.068 percentage point decrease in the probability of having a disclosure law.” That’s wrong! We have to convert those logit scale coefficients to a probability scale instead. We can do this mathematically by combining both the intercept and the coefficient using plogis(intercept + coefficient)  plogis(intercept)
, but that’s generally not recommended, especially when there are other coefficients (see this section on logistic regression for more details). Additionally, manually combining intercepts and coefficients won’t give us standard errors or any other kind of uncertainty.
Instead, we can calculate the average slope of the logistic regression fit using either marginaleffects()
or emtrends()
.
First we’ll use marginaleffects()
. Remember that it calculates the average marginal effect (AME) by plugging each row of the original data into the model, generating predictions and instantaneous slopes for each row, and then averaging the dydx
column. Each row contains actual observed data, so the predictions arguably reflect variation in reality. marginaleffects()
helpfully converts the AME into percentage points (note that it says “Prediction type: response”), so we can interpret the value directly.
model_logit >
marginaleffects() >
summary()
## Term Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 public_sector_corruption 0.00846 0.000261 32.4 <2e16 0.00897 0.00795
##
## Model type: glm
## Prediction type: response
The average marginal effect for public sector corruption is −0.0085, which means that on average, a onepoint increase in the public sector corruption index (i.e. as corruption gets worse) is associated with a −0.85 percentage point decrease in the probability of a country having a disclosure law.
Next we’ll use emtrends()
, which calculates the marginal effect at the mean (MEM) by averaging all the model covariates first, plugging those averages into the model, and generating a single instantaneous slope. The values that get plugged into the model won’t necessarily reflect reality—especially once more covariates are involved, which we’ll see later. By default emtrends()
returns the results on the logit scale, but we can convert them to the response/percentage point scale by adding the regrid = "response"
argument:
model_logit >
emtrends(~ public_sector_corruption,
var = "public_sector_corruption",
regrid = "response")
## public_sector_corruption public_sector_corruption.trend SE df asymp.LCL asymp.UCL
## 45.8 0.0125 0.0017 Inf 0.0158 0.00916
##
## Confidence level used: 0.95
# marginaleffects() will show the same MEM result with `newdata = "mean"`
# marginaleffects(model_logit, newdata = "mean") > summary()
When we plug the average public sector corruption (45.82) into the model, we get an MEM of −0.0125, which means that on average, a onepoint increase in the public sector corruption index is associated with a −1.25 percentage point decrease in the probability of a country having a disclosure law. That’s different (and bigger!) than the AME we found with marginaleffects()
!
Let’s plot these marginal effects and their uncertainty to see how much they differ:
# Get tidied results from marginaleffects()
plot_ame < model_logit >
marginaleffects() >
tidy()
# Get tidied results from emtrends()
plot_mem < model_logit >
emtrends(~ public_sector_corruption,
var = "public_sector_corruption",
regrid = "response") >
tidy(conf.int = TRUE) >
rename(estimate = public_sector_corruption.trend)
# Combine the two tidy data frames for plotting
plot_effects < bind_rows("AME" = plot_ame, "MEM" = plot_mem, .id = "type") >
mutate(nice_slope = nice_number(estimate * 100))
ggplot(plot_effects, aes(x = estimate * 100, y = fct_rev(type), color = type)) +
geom_vline(xintercept = 0, size = 0.5, linetype = "24", color = clrs[1]) +
geom_pointrange(aes(xmin = conf.low * 100, xmax = conf.high * 100)) +
geom_label(aes(label = nice_slope), nudge_y = 0.3) +
labs(x = "Marginal effect (percentage points)", y = NULL) +
scale_color_manual(values = c(clrs[2], clrs[5]), guide = "none") +
theme_mfx()
That’s fascinating! The confidence interval around the AME is really small compared to the MEM, likely because the AME estimate comes from the average of 168 values, while the MEM is the prediction of a single value. Additionally, while both estimates hover around a 1 percentage point decrease, the AME is larger than −1 while the MEM is smaller.
For fun, let’s make a super fancy logistic regression model with a quadratic term and an interaction. We’ll compare the AME and MEM for public sector corruption again. This is where either marginaleffects()
or emtrends()
is incredibly helpful—correctly combining all the necessary coefficients, given that corruption is both squared and interacted, and given that there are other variables to worry about, would be really hard.
Here are the average marginal effects (AME) (again, each original row is plugged into the model, a slope is calculated for each, and then they’re all averaged together):
model_logit_fancy >
marginaleffects() >
summary()
## Term Contrast Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 public_sector_corruption dY/dX 0.00653 0.00192 3.408 7e04 0.01029 0.00278
## 2 polyarchy dY/dX 0.00226 0.00172 1.316 0.188 0.00111 0.00564
## 3 log_gdp_percapita dY/dX 0.00960 0.03782 0.254 0.800 0.06453 0.08373
## 4 region Latin America and the Caribbean  Eastern Europe and Central Asia 0.26174 0.10388 2.520 0.012 0.46535 0.05813
## 5 region Middle East and North Africa  Eastern Europe and Central Asia 0.20647 0.10835 1.906 0.057 0.41884 0.00589
## 6 region SubSaharan Africa  Eastern Europe and Central Asia 0.24986 0.11533 2.167 0.030 0.47590 0.02383
## 7 region Western Europe and North America  Eastern Europe and Central Asia 0.29109 0.09726 2.993 0.003 0.48171 0.10046
## 8 region Asia and Pacific  Eastern Europe and Central Asia 0.20418 0.09534 2.142 0.032 0.39103 0.01732
##
## Model type: glm
## Prediction type: response
And here are the marginal effects at the mean (MEM) (again, the average values for each covariate are plugged into the model). Using emtrends()
results in a note about interactions, so we’ll use marginaleffects(..., newdata = "mean")
instead:
model_logit_fancy >
emtrends(~ public_sector_corruption,
var = "public_sector_corruption",
regrid = "response")
## NOTE: Results may be misleading due to involvement in interactions
## public_sector_corruption public_sector_corruption.trend SE df asymp.LCL asymp.UCL
## 45.8 0.00955 0.00301 Inf 0.0155 0.00366
##
## Results are averaged over the levels of: region
## Confidence level used: 0.95
# This uses marginaleffects() to find the MEM instead
model_logit_fancy >
marginaleffects(newdata = "mean") >
summary()
## Term Contrast Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 public_sector_corruption dY/dX 0.01004 0.00634 1.584 0.113 0.0225 0.00238
## 2 polyarchy dY/dX 0.00217 0.00223 0.972 0.331 0.0022 0.00654
## 3 log_gdp_percapita dY/dX 0.00919 0.03940 0.233 0.816 0.0680 0.08641
## 4 region Latin America and the Caribbean  Eastern Europe and Central Asia 0.37448 0.17046 2.197 0.028 0.7086 0.04039
## 5 region Middle East and North Africa  Eastern Europe and Central Asia 0.57907 0.21821 2.654 0.008 1.0068 0.15139
## 6 region SubSaharan Africa  Eastern Europe and Central Asia 0.56591 0.17710 3.195 0.001 0.9130 0.21880
## 7 region Western Europe and North America  Eastern Europe and Central Asia 0.65700 0.17381 3.780 2e04 0.9977 0.31634
## 8 region Asia and Pacific  Eastern Europe and Central Asia 0.53122 0.20644 2.573 0.010 0.9358 0.12660
##
## Model type: glm
## Prediction type: response
Now that we’re working with multiple covariates, we have instantaneous marginal effects for each regression term, which is neat. We only care about corruption here, so let’s extract the slopes and plot them:
plot_ame_fancy < model_logit_fancy >
marginaleffects() >
tidy()
plot_mem_fancy < model_logit_fancy >
marginaleffects(newdata = "mean") >
tidy()
# Combine the two tidy data frames for plotting
plot_effects < bind_rows("AME" = plot_ame_fancy, "MEM" = plot_mem_fancy, .id = "type") >
filter(term == "public_sector_corruption") >
mutate(nice_slope = nice_number(estimate * 100))
ggplot(plot_effects, aes(x = estimate * 100, y = fct_rev(type), color = type)) +
geom_vline(xintercept = 0, size = 0.5, linetype = "24", color = clrs[1]) +
geom_pointrange(aes(xmin = conf.low * 100, xmax = conf.high * 100)) +
geom_label(aes(label = nice_slope), nudge_y = 0.3) +
labs(x = "Marginal effect (percentage points)", y = NULL) +
scale_color_manual(values = c(clrs[2], clrs[5]), guide = "none") +
theme_mfx()
Yikes! The AME is statistically significant (p < 0.001) with a narrower confidence interval, but the MEM includes zero in its confidence interval and isn’t significant (p = 0.113).
The choice of marginal effect averaging thus matters a lot!
To make life even more exciting, we’re not limited to just average marginal effects (AMEs) or marginal effects at the mean (MEMs). Additionally, if we think back to the slider/switch/mixing board analogy, all we’ve really done so far with our logistic regression model is move one slider (public_sector_corruption
) up and down. What happens if we move other switches and sliders at the same time? (i.e. the marginal effect of corruption at specific values of corruption, or across different regions, or at different levels of GDP per capita and polyarchy)
We can use both marginaleffects()
and emtrends()
/emmeans()
to play with our model’s full mixing board. We’ll continue to use the logistic regression model as an example since it’s sensitive to the order of averaging.
If we have categorical covariates in our model like region
, we can find the average marginal effect (AME) of continuous predictors across those different groups. This is fairly straightforward when working with marginaleffects()
because of its approach to averaging. Remember that with the AME, each original row gets its own fitted value and each individual slope, which we can then average and collapse into a single row. Group characteristics like region are maintained after calculating predictions, so we can calculate group averages of the individual slopes. This outlines the process:
Because we’re working with the AME, we have a dydx
column with instantaneous slopes for each row in the original data:
# We'll specify variables = "public_sector_corruption" here to filter the
# marginal effects results. If we don't we'll get dozens of separate marginal
# effects later when using summary() or tidy(), for each of the coefficients,
# interactions, and crossregion contrasts
mfx_logit_fancy < model_logit_fancy >
marginaleffects(variables = "public_sector_corruption")
# Original data frame + estimated dydx for each row
head(mfx_logit_fancy)
## rowid type term dydx std.error statistic p.value conf.low conf.high predicted predicted_hi predicted_lo disclose_donations public_sector_corruption public_sector_corruption.1 polyarchy log_gdp_percapita region eps
## 1 1 response public_sector_corruption 0.008825 0.00768 1.149 0.251 0.02388 0.00623 0.3487 0.3486 0.3487 TRUE 48.8 2381.44 64.7 9.10 Latin America and the Caribbean 0.00966
## 2 2 response public_sector_corruption 0.000897 0.00790 0.114 0.910 0.01637 0.01458 0.5280 0.5279 0.5280 FALSE 24.8 615.04 76.1 8.93 Latin America and the Caribbean 0.00966
## 3 3 response public_sector_corruption 0.006759 0.00510 1.325 0.185 0.01676 0.00324 0.9045 0.9045 0.9045 FALSE 1.3 1.69 90.8 10.85 Western Europe and North America 0.00966
## 4 4 response public_sector_corruption 0.006727 0.00546 1.232 0.218 0.01743 0.00397 0.9052 0.9052 0.9052 FALSE 1.4 1.96 89.4 11.36 Western Europe and North America 0.00966
## 5 5 response public_sector_corruption 0.002460 0.00322 0.764 0.445 0.00877 0.00385 0.0198 0.0197 0.0198 FALSE 65.2 4251.04 72.0 7.61 SubSaharan Africa 0.00966
## 6 6 response public_sector_corruption 0.005864 0.00556 1.055 0.291 0.01676 0.00503 0.0538 0.0538 0.0538 FALSE 57.1 3260.41 70.3 8.64 SubSaharan Africa 0.00966
All the original columns are still there, which means we can collapse the results however we want. For instance, here’s the average marginal effect across each region:
mfx_logit_fancy >
group_by(region) >
summarize(region_ame = mean(dydx))
## # A tibble: 6 × 2
## region region_ame
## <fct> <dbl>
## 1 Eastern Europe and Central Asia 0.00751
## 2 Latin America and the Caribbean 0.00326
## 3 Middle East and North Africa 0.00629
## 4 SubSaharan Africa 0.00435
## 5 Western Europe and North America 0.0112
## 6 Asia and Pacific 0.00830
We can also use summarizing methods built in to marginaleffects by using the by
argument in marginaleffects()
. This is the better option, since it does some tricky standard error calculations behind the scenes:
model_logit_fancy >
marginaleffects(variables = "public_sector_corruption",
by = "region")
## type term contrast dydx std.error statistic p.value conf.low conf.high predicted predicted_hi predicted_lo region
## 1 response public_sector_corruption mean(dY/dX) 0.00326 0.00376 0.865 0.38693 0.01063 0.00412 3.49e01 3.49e01 3.49e01 Latin America and the Caribbean
## 2 response public_sector_corruption mean(dY/dX) 0.01123 0.00911 1.233 0.21745 0.02908 0.00662 9.05e01 9.04e01 9.05e01 Western Europe and North America
## 3 response public_sector_corruption mean(dY/dX) 0.00435 0.00160 2.722 0.00649 0.00748 0.00122 1.98e02 1.97e02 1.98e02 SubSaharan Africa
## 4 response public_sector_corruption mean(dY/dX) 0.00830 0.00285 2.916 0.00355 0.01389 0.00272 9.11e01 9.11e01 9.11e01 Asia and Pacific
## 5 response public_sector_corruption mean(dY/dX) 0.00751 0.00248 3.029 0.00245 0.01236 0.00265 1.86e01 1.86e01 1.86e01 Eastern Europe and Central Asia
## 6 response public_sector_corruption mean(dY/dX) 0.00629 0.00194 3.245 0.00117 0.01009 0.00249 9.03e05 9.02e05 9.03e05 Middle East and North Africa
These are on the percentage point scale, not the logit scale, so we can interpret them directly. In Western Europe, the AME of corruption is −0.0112, so a onepoint increase in the public sector corruption index there is associated with a −1.12 percentage point decrease in the probability of having a campaign finance disclosure law, on average (though it’s not actually significant (p = 0.217)). In the Middle East, on the other hand, corruption seems to matter less for disclosure laws—an increase in the corruption index there is associated with a −0.63 percentage point decrease in the probability of having a laws, on average (and that is significant (p = 0.001)).
We can use emtrends()
to get regionspecific slopes, but we’ll get different results because of the order of averaging. emmeans creates averages and then plugs them in; marginaleffects plugs all the values in and then creates averages:
model_logit_fancy >
emtrends(~ public_sector_corruption + region,
var = "public_sector_corruption", regrid = "response")
## public_sector_corruption region public_sector_corruption.trend SE df asymp.LCL asymp.UCL
## 45.8 Eastern Europe and Central Asia 0.01119 0.00554 Inf 0.0221 0.00033
## 45.8 Latin America and the Caribbean 0.00734 0.00611 Inf 0.0193 0.00463
## 45.8 Middle East and North Africa 0.01172 0.01105 Inf 0.0334 0.00993
## 45.8 SubSaharan Africa 0.01001 0.00607 Inf 0.0219 0.00188
## 45.8 Western Europe and North America 0.00335 0.00769 Inf 0.0184 0.01172
## 45.8 Asia and Pacific 0.01371 0.00667 Inf 0.0268 0.00063
##
## Confidence level used: 0.95
We can replicate the results from emtrends()
with marginaleffects()
if we plug in average or representative values (more on that in the next section), since that follows the same averaging order as emmeans (i.e. plugging averages into the model)
model_logit_fancy >
marginaleffects(variables = "public_sector_corruption",
newdata = datagrid(region = levels(corruption$region)),
by = "region") >
summary()
## Term Contrast region Effect Std. Error z value Pr(>z) 2.5 % 97.5 %
## 1 public_sector_corruption mean(dY/dX) Eastern Europe and Central Asia 0.01117 0.00573 1.950 0.05 0.0224 5.47e05
## 2 public_sector_corruption mean(dY/dX) Latin America and the Caribbean 0.00733 0.00692 1.059 0.29 0.0209 6.24e03
## 3 public_sector_corruption mean(dY/dX) Middle East and North Africa 0.01177 0.01110 1.060 0.29 0.0335 9.98e03
## 4 public_sector_corruption mean(dY/dX) SubSaharan Africa 0.01004 0.00634 1.584 0.11 0.0225 2.38e03
## 5 public_sector_corruption mean(dY/dX) Western Europe and North America 0.00336 0.00776 0.434 0.66 0.0186 1.18e02
## 6 public_sector_corruption mean(dY/dX) Asia and Pacific 0.01375 0.00702 1.959 0.05 0.0275 8.69e06
##
## Model type: glm
## Prediction type: response
If we want to unlock the full potential of our regression mixing board, we can feed the model any values we want. In general, we’ll (1) make a little dataset with covariate values set to either specific values that we care about, or typical or average values, (2) plug that little dataset into the the model and get fitted values, and (3) work with the results. There are a bunch of different names for this little fake dataset like “data grid” and “reference grid”, but they’re all the same idea. Here’s an overview of the approach:
Before plugging anything in, it’s helpful to look at different ways of creating data grids with R. For all these examples, we’ll make a dataset with public sector corruption set to 20 and 80 across Western Europe, Latin America, and the Middle East, with all other variables in the model set to their means. We’ll make a little list of these regions to save typing time:
regions_to_use < c("Western Europe and North America",
"Latin America and the Caribbean",
"Middle East and North Africa")
First, we can do it all manually with the expand_grid()
function from tidyr (or expand.grid()
from base R). This creates a data frame from all combinations of the vectors and single values we feed it.
expand_grid(public_sector_corruption = c(20, 80),
region = regions_to_use,
polyarchy = mean(corruption$polyarchy),
log_gdp_percapita = mean(corruption$log_gdp_percapita))
## # A tibble: 6 × 4
## public_sector_corruption region polyarchy log_gdp_percapita
## <dbl> <chr> <dbl> <dbl>
## 1 20 Western Europe and North America 52.8 8.57
## 2 20 Latin America and the Caribbean 52.8 8.57
## 3 20 Middle East and North Africa 52.8 8.57
## 4 80 Western Europe and North America 52.8 8.57
## 5 80 Latin America and the Caribbean 52.8 8.57
## 6 80 Middle East and North Africa 52.8 8.57
A disadvantage of using expand_grid()
like this is that the averages we calculated aren’t necessarily the same averages of the data that gets used in the model. If any rows are dropped in the model because of missing values, that won’t be reflected here. We could get around that by doing model.frame(model_logit_fancy)$polyarchy
, but that’s starting to get unwieldy. Instead, we can use a function that takes information about the model into account.
Second, we can use data_grid()
from modelr, which is part of the really neat tidymodels ecosystem. An advantage of doing this is that it will handle the typical value part automatically—it will calculate the mean for continuous predictors and the mode for categorical predictors.
modelr::data_grid(corruption,
public_sector_corruption = c(20, 80),
region = regions_to_use,
.model = model_logit_fancy)
## # A tibble: 6 × 4
## public_sector_corruption region polyarchy log_gdp_percapita
## <dbl> <chr> <dbl> <dbl>
## 1 20 Latin America and the Caribbean 54.2 8.52
## 2 20 Middle East and North Africa 54.2 8.52
## 3 20 Western Europe and North America 54.2 8.52
## 4 80 Latin America and the Caribbean 54.2 8.52
## 5 80 Middle East and North Africa 54.2 8.52
## 6 80 Western Europe and North America 54.2 8.52
Third, we can use marginaleffects’s datagrid()
, which will also calculate typical values for any covariates we don’t specify:
datagrid(model = model_logit_fancy,
public_sector_corruption = c(20, 80),
region = regions_to_use)
## disclose_donations polyarchy log_gdp_percapita public_sector_corruption region
## 1 FALSE 52.8 8.57 20 Western Europe and North America
## 2 FALSE 52.8 8.57 20 Latin America and the Caribbean
## 3 FALSE 52.8 8.57 20 Middle East and North Africa
## 4 FALSE 52.8 8.57 80 Western Europe and North America
## 5 FALSE 52.8 8.57 80 Latin America and the Caribbean
## 6 FALSE 52.8 8.57 80 Middle East and North Africa
And finally, we can use emmeans’s ref_grid()
, which will also automatically create typical values. This doesn’t return a data frame—it’s some sort of special ref_grid
object, but all the important information is still there:
ref_grid(model_logit_fancy,
at = list(public_sector_corruption = c(20, 80),
region = regions_to_use))
## 'emmGrid' object with variables:
## public_sector_corruption = 20, 80
## polyarchy = 52.79
## log_gdp_percapita = 8.5674
## region = Western Europe and North America, Latin America and the Caribbean, Middle East and North Africa
## Transformation: "logit"
Now that we have a hypothetical data grid of sliders and switches set to specific values, we can plug it into the model and generate fitted values. Importantly, doing this provides us with results that are analogous to the marginal effects at the mean (MEM) that we found earlier, and not the average marginal effect (AME), since we’re not feeding the entire original dataset to the model. None of these hypothetical rows exist in real life—there is no country with any of these exact combinations of corruption, polyarchy/democracy, GDP per capita, or region.
model_logit_fancy >
marginaleffects(variables = "public_sector_corruption",
newdata = datagrid(public_sector_corruption = c(20, 80),
region = regions_to_use))
## rowid type term dydx std.error statistic p.value conf.low conf.high predicted predicted_hi predicted_lo disclose_donations polyarchy log_gdp_percapita public_sector_corruption region eps
## 1 1 response public_sector_corruption 2.49e02 1.57e02 1.5926 0.111 0.055603 0.005751 3.81e01 3.81e01 3.81e01 FALSE 52.8 8.57 20 Western Europe and North America 0.00966
## 2 2 response public_sector_corruption 8.27e04 8.56e03 0.0966 0.923 0.015956 0.017610 3.97e01 3.97e01 3.97e01 FALSE 52.8 8.57 20 Latin America and the Caribbean 0.00966
## 3 3 response public_sector_corruption 2.04e02 1.43e02 1.4228 0.155 0.048514 0.007704 6.58e01 6.58e01 6.58e01 FALSE 52.8 8.57 20 Middle East and North Africa 0.00966
## 4 4 response public_sector_corruption 1.49e05 9.71e05 0.1534 0.878 0.000205 0.000175 7.69e05 7.67e05 7.69e05 FALSE 52.8 8.57 80 Western Europe and North America 0.00966
## 5 5 response public_sector_corruption 4.36e03 3.65e03 1.1943 0.232 0.011528 0.002798 5.45e02 5.45e02 5.45e02 FALSE 52.8 8.57 80 Latin America and the Caribbean 0.00966
## 6 6 response public_sector_corruption 1.06e04 4.47e04 0.2368 0.813 0.000982 0.000770 5.93e04 5.92e04 5.93e04 FALSE 52.8 8.57 80 Middle East and North Africa 0.00966
model_logit_fancy >
emtrends(~ public_sector_corruption + region, var = "public_sector_corruption",
at = list(public_sector_corruption = c(20, 80),
region = regions_to_use),
regrid = "response", delta.var = 0.001)
## public_sector_corruption region public_sector_corruption.trend SE df asymp.LCL asymp.UCL
## 20 Western Europe and North America 0.02493 0.01560 Inf 0.0555 0.00565
## 80 Western Europe and North America 0.00001 0.00009 Inf 0.0002 0.00015
## 20 Latin America and the Caribbean 0.00083 0.00860 Inf 0.0160 0.01768
## 80 Latin America and the Caribbean 0.00437 0.00353 Inf 0.0113 0.00256
## 20 Middle East and North Africa 0.02040 0.01433 Inf 0.0485 0.00768
## 80 Middle East and North Africa 0.00011 0.00038 Inf 0.0009 0.00065
##
## Confidence level used: 0.95
We have a ton of marginal effects here, but this is all starting to get really complicated. These are slopes, but slopes for which lines? What do these marginal effects actually look like?
Plotting these regression lines is tricky because we’re no longer working with a single variable on the xaxis. Instead, we need to generate predicted values of the regression outcome across a range of one while holding all the other variables constant. This is exactly what we’ve been doing to get marginal effects, only now instead of getting slopes as the output, we want fitted values. Both marginaleffects and emmeans make this easy.
In the marginaleffects world, we can use predictions()
. Instead of a dydx
column for the slope, we have a predicted
column for the fitted value from the model:
model_logit_fancy >
predictions(newdata = datagrid(public_sector_corruption = c(20, 80),
region = regions_to_use))
## rowid type predicted std.error statistic p.value conf.low conf.high disclose_donations polyarchy log_gdp_percapita public_sector_corruption region
## 1 1 response 3.81e01 1.51e01 2.516 0.011875 1.49e01 0.68404 FALSE 52.8 8.57 20 Western Europe and North America
## 2 2 response 3.97e01 1.34e01 2.960 0.003073 1.80e01 0.66437 FALSE 52.8 8.57 20 Latin America and the Caribbean
## 3 3 response 6.58e01 1.77e01 3.710 0.000207 2.91e01 0.90016 FALSE 52.8 8.57 20 Middle East and North Africa
## 4 4 response 7.69e05 6.16e05 1.248 0.212184 1.60e05 0.00037 FALSE 52.8 8.57 80 Western Europe and North America
## 5 5 response 5.45e02 8.55e02 0.638 0.523582 2.23e03 0.59798 FALSE 52.8 8.57 80 Latin America and the Caribbean
## 6 6 response 5.93e04 8.12e04 0.731 0.465014 4.05e05 0.00862 FALSE 52.8 8.57 80 Middle East and North Africa
In the emmeans world, we can use emmeans()
:
model_logit_fancy >
emmeans(~ public_sector_corruption + region, var = "public_sector_corruption",
at = list(public_sector_corruption = c(20, 80),
region = regions_to_use),
regrid = "response")
## public_sector_corruption region prob SE df asymp.LCL asymp.UCL
## 20 Western Europe and North America 0.381 0.2975 Inf 0.2021 0.964
## 80 Western Europe and North America 0.000 0.0005 Inf 0.0009 0.001
## 20 Latin America and the Caribbean 0.397 0.1827 Inf 0.0395 0.755
## 80 Latin America and the Caribbean 0.055 0.0776 Inf 0.0975 0.207
## 20 Middle East and North Africa 0.658 0.2505 Inf 0.1671 1.149
## 80 Middle East and North Africa 0.001 0.0024 Inf 0.0042 0.005
##
## Confidence level used: 0.95
The results from the two packages are identical because we’re using a data grid—in both cases we’re averaging before plugging stuff into the model.
Instead of setting corruption to 20 and 80, we’ll use a whole range of values so we can plot it.
logit_predictions < model_logit_fancy >
emmeans(~ public_sector_corruption + region, var = "public_sector_corruption",
at = list(public_sector_corruption = seq(0, 90, 1)),
regrid = "response") >
as_tibble()
ggplot(logit_predictions, aes(x = public_sector_corruption, y = prob, color = region)) +
geom_line(size = 1) +
labs(x = "Public sector corruption", y = "Predicted probability of having\na campaign finance disclosure law", color = NULL) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c(clrs, "grey30")) +
theme_mfx() +
theme(legend.position = "bottom")
(Alternatively, you can use marginaleffects’s builtin plot_cap()
to make this plot with one line of code):
That’s such a cool plot! Each region has a different shape of predicted probabilities across public sector corruption.
Earlier we calculated a bunch of instantaneous slopes when corruption was set to 20 and 80 in a few different regions, so let’s put those slopes and their tangent lines on the plot:
logit_slopes < model_logit_fancy >
emtrends(~ public_sector_corruption + region, var = "public_sector_corruption",
at = list(public_sector_corruption = c(20, 80),
region = regions_to_use),
regrid = "response", delta.var = 0.001) >
as_tibble() >
mutate(panel = glue("Corruption set to {public_sector_corruption}"))
slopes_to_plot < logit_predictions >
filter(public_sector_corruption %in% c(20, 80),
region %in% regions_to_use) >
left_join(select(logit_slopes, public_sector_corruption, region, public_sector_corruption.trend, panel),
by = c("public_sector_corruption", "region")) >
mutate(intercept = find_intercept(public_sector_corruption, prob, public_sector_corruption.trend)) >
mutate(round_slope = label_number(accuracy = 0.001, style_negative = "minus")(public_sector_corruption.trend * 100),
nice_slope = glue("Slope: {round_slope} pct pts"))
ggplot(logit_predictions, aes(x = public_sector_corruption, y = prob, color = region)) +
geom_line(size = 1) +
geom_point(data = slopes_to_plot, size = 2, show.legend = FALSE) +
geom_abline(data = slopes_to_plot,
aes(slope = public_sector_corruption.trend, intercept = intercept, color = region),
size = 0.5, linetype = "21", show.legend = FALSE) +
geom_label_repel(data = slopes_to_plot, aes(label = nice_slope),
fontface = "bold", seed = 123, show.legend = FALSE,
size = 3, direction = "y") +
labs(x = "Public sector corruption",
y = "Predicted probability of having\na campaign finance disclosure law",
color = NULL) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("grey30", clrs)) +
facet_wrap(vars(panel)) +
theme_mfx() +
theme(legend.position = "bottom")
AHH this is delightful! This helps us understand and visualize all these marginal effects. Let’s interpret them:
MAJOR CAVEAT: None of these marginal effects are statistically significant, so there’s a good chance that they’re possibly zero, or positive, or more negative, or whatever. We can plot just these marginal slopes to show this:
ggplot(logit_slopes, aes(x = public_sector_corruption.trend * 100, y = region, color = region)) +
geom_vline(xintercept = 0, size = 0.5, linetype = "24", color = clrs[5]) +
geom_pointrange(aes(xmin = asymp.LCL * 100, xmax = asymp.UCL * 100)) +
scale_color_manual(values = c(clrs[4], clrs[1], clrs[2]), guide = "none") +
labs(x = "Marginal effect (percentage points)", y = NULL) +
facet_wrap(vars(panel), ncol = 1) +
theme_mfx()
Calculating marginal effects at representative values is useful and widespread—plugging different values into the model while holding others constant is the best way to see how all the different moving parts of a model work, especially when there interactions, exponents, or nonlinear outcomes. We’re using the full mixing panel here!
However, creating a hypothetical data or reference grid creates hypothetical observations that might never exist in real life. This was the main difference behind the average marginal effect (AME) and the marginal effect at the mean (MEM) that we looked at earlier. Passing average covariate values into a model creates average predictions, but those averages might not reflect reality.
For example, we used this data grid to look at the effect of corruption on the probability of having a campaign finance disclosure law across different regions:
datagrid(model = model_logit_fancy,
public_sector_corruption = c(20, 80),
region = regions_to_use)
## disclose_donations polyarchy log_gdp_percapita public_sector_corruption region
## 1 FALSE 52.8 8.57 20 Western Europe and North America
## 2 FALSE 52.8 8.57 20 Latin America and the Caribbean
## 3 FALSE 52.8 8.57 20 Middle East and North Africa
## 4 FALSE 52.8 8.57 80 Western Europe and North America
## 5 FALSE 52.8 8.57 80 Latin America and the Caribbean
## 6 FALSE 52.8 8.57 80 Middle East and North Africa
Polyarchy (democracy) and GDP per capita here are set at their datasetlevel means, but that’s not how the world actually works. Levels of democracy and personal wealth vary a lot by region:
corruption >
filter(region %in% regions_to_use) >
group_by(region) >
summarize(avg_polyarchy = mean(polyarchy),
avg_log_gdp_percapita = mean(log_gdp_percapita))
## # A tibble: 3 × 3
## region avg_polyarchy avg_log_gdp_percapita
## <fct> <dbl> <dbl>
## 1 Latin America and the Caribbean 62.9 8.77
## 2 Middle East and North Africa 27.4 9.01
## 3 Western Europe and North America 86.5 10.7
Western Europe is far more democratic (average polyarchy = 86.50) than the Middle East (average polyarchy = 27.44). But in our calculations for finding regionspecific marginal effects, we’ve been using a polyarchy value of 52.79 for all the regions.
Fortunately we can do something neat to work with observed covariate values and thus create an AMEflavored marginal effect at representative values instead of the current MEMflavored marginal effect at representative values. Here’s the general process:
Instead of creating a data or reference grid, we create multiple copies of our original dataset. In each copy we change the columns that we want to set to specific values and we leave all the other columns at their original values. We then feed all the copies of the dataset into the model and generate a ton of fitted values, which we then collapse into average effects.
That sounds really complex, but it’s only a matter of adding one argument to marginaleffects::datagrid()
. We’ll take region
out of datagrid
here so that we keep all the original regions—we’ll take the average across those regions after the fact.
This new data grid has twice the number of rows that we have in the original data, since there are now two copies of the data stacked together: