# Visualizing the differences between Bayesian posterior predictions, linear predictions, and the expectation of posterior predictions

A guide to different types of Bayesian posterior distributions and the nuances of posterior_predict, posterior_epred, and posterior_linpred
r
tidyverse
ggplot
regression
bayes
brms
stan
Author
Published

Monday, September 26, 2022

You can download PDF, SVG, and PNG versions of the diagrams and cheat sheets in this post, as well as the original Adobe Illustrator and InDesign files, at the bottom of this post

Do whatever you want with them! They’re licensed under Creative Commons Attribution-ShareAlike (BY-SA 4.0).

I’ve been working with Bayesian models and the Stan-based brms ecosystem (tidybayes, ggdist, marginaleffects, and friends) for a few years now, and I’m currently finally working through formal materials on Bayesianism and running an independent readings class with a PhD student at GSU where we’re reading Richard McElreath’s Statistical Rethinking and Alicia Johnson, Miles Ott, and Mine Dogucu’s Bayes Rules!, both of which are fantastic books (check out my translation of their materials to tidyverse/brms here).

Something that has always plagued me about working with Bayesian posterior distributions, but that I’ve always waved off as too hard to think about, has been the differences between posterior predictions, the expectation of the posterior predictive distribution, and the posterior of the linear predictor (or posterior_predict(), posterior_epred(), and posterior_linpred() in the brms world). But reading these two books has forced me to finally figure it out.

So here’s an explanation of my mental model of the differences between these types of posterior distributions. It’s definitely not 100% correct, but it makes sense for me.

Let’s load some packages, load some data, and get started!

library(tidyverse)        # ggplot, dplyr, and friends
library(patchwork)        # Combine ggplot plots
library(ggtext)           # Fancier text in ggplot plots
library(scales)           # Labeling functions
library(brms)             # Bayesian modeling through Stan
library(tidybayes)        # Manipulate Stan objects in a tidy way
library(marginaleffects)  # Calculate marginal effects
library(modelr)           # For quick model grids
library(extraDistr)       # For dprop() beta distribution with mu/phi
library(distributional)   # For plotting distributions with ggdist
library(palmerpenguins)   # Penguins!
library(kableExtra)       # For nicer tables

# Make random things reproducible
set.seed(1234)

# Bayes stuff
# Use the cmdstanr backend for Stan because it's faster and more modern than
# the default rstan. You need to install the cmdstanr package first
# (https://mc-stan.org/cmdstanr/) and then run cmdstanr::install_cmdstan() to
# install cmdstan on your computer.
options(mc.cores = 4,  # Use 4 cores
brms.backend = "cmdstanr")
bayes_seed <- 1234

# Colors from MetBrewer
clrs <- MetBrewer::met.brewer("Java")

# Custom ggplot themes to make pretty plots
# Get Roboto Condensed at https://fonts.google.com/specimen/Roboto+Condensed
# Get Roboto Mono at https://fonts.google.com/specimen/Roboto+Mono
theme_pred <- function() {
theme_minimal(base_family = "Roboto Condensed") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
axis.title.x = element_text(hjust = 0),
axis.title.y = element_text(hjust = 0),
legend.title = element_text(face = "bold"))
}

theme_pred_dist <- function() {
theme_pred() +
theme(plot.title = element_markdown(family = "Roboto Condensed", face = "plain"),
plot.subtitle = element_text(family = "Roboto Mono", size = rel(0.9), hjust = 0),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
}

theme_pred_range <- function() {
theme_pred() +
theme(plot.title = element_markdown(family = "Roboto Condensed", face = "plain"),
plot.subtitle = element_text(family = "Roboto Mono", size = rel(0.9), hjust = 0),
panel.grid.minor.y = element_blank())
}

update_geom_defaults("text", list(family = "Roboto Condensed", lineheight = 1))
# Add a couple new variables to the penguins data:
#  - is_gentoo: Indicator for whether or not the penguin is a Gentoo
#  - bill_ratio: The ratio of a penguin's bill depth (height) to its bill length
penguins <- penguins |>
drop_na(sex) |>
mutate(is_gentoo = species == "Gentoo") |>
mutate(bill_ratio = bill_depth_mm / bill_length_mm)

## Normal Gaussian model

First we’ll look at basic linear regression. Normal or Gaussian models are roughly equivalent to frequentist ordinary least squares (OLS) regression. We estimate an intercept and a slope and draw a line through the data. If we include multiple explanatory variables or predictors, we’ll have multiple slopes, or partial derivatives or marginal effects (see here for more about that). But to keep things as simple and basic and illustrative as possible, we’ll just use one explanatory variable here.

In this example, we’re interested in the relationship between penguin flipper length and penguin body mass. Do penguins with longer flippers weigh more? Here’s what the data looks like:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(size = 1, alpha = 0.7) +
geom_smooth(method = "lm", color = clrs[5], se = FALSE) +
scale_y_continuous(labels = label_comma()) +
coord_cartesian(ylim = c(2000, 6000)) +
labs(x = "Flipper length (mm)", y = "Body mass (g)") +
theme_pred()

It seems like there’s a pretty clear relationship between the two. As flipper length increases, body mass also increases.

We can create a more formal model for the distribution of body mass, conditional on different values of flipper length, like this:

\begin{aligned} \text{Body mass}_i &\sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta \ \text{Flipper length}_i \end{aligned}

Or more generally:

\begin{aligned} y_i &\sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta x_i \end{aligned}

This implies that body mass follows a normal (or Gaussian) distribution with some average ($$\mu$$) and some amount of spread ($$\sigma$$), and that the $$\mu$$ parameter is conditional on (or based on, or dependent on) flipper length.

Let’s run that model in Stan through brms (with all the default priors; in real life you’d want to set more official priors for the intercept $$\alpha$$, the coefficient $$\beta$$, and the overall model spread $$\sigma$$)

model_normal <- brm(
bf(body_mass_g ~ flipper_length_mm),
family = gaussian(),
data = penguins
)
## Start sampling

If we look at the model results, we can see the means of the posterior distributions of each of the model’s parameters ($$\alpha$$, $$\beta$$, and $$\sigma$$). The intercept ($$\alpha$$) is huge and negative because flipper length is far away from 0, so it’s pretty uninterpretable. The $$\beta$$ coefficient shows that a one-mm increase in flipper length is associated with a 50 gram increase in body mass. And the overall model standard deviation $$\sigma$$ shows that there’s roughly 400 grams of deviation around the mean body mass.

broom.mixed::tidy(model_normal) |>
bind_cols(parameter = c("α", "β", "σ")) |>
select(parameter, term, estimate, std.error, conf.low, conf.high)
## # A tibble: 3 × 6
##   parameter term              estimate std.error conf.low conf.high
##   <chr>     <chr>                <dbl>     <dbl>    <dbl>     <dbl>
## 1 α         (Intercept)        -5874.     311.    -6466.    -5257.
## 2 β         flipper_length_mm     50.2      1.54     47.1      53.1
## 3 σ         sd__Observation      394.      15.7     366.      426.

That table shows just the posterior means for each of these parameters, but these are technically all complete distributions. In this post we’re not interested in these actual values—we’re concerned with the outcome, or penguin weight here. (But you can see this post or this post or this post or this documentation for more about working with these coefficients and calculating marginal effects)

Going back to the formal model, so far we’ve looked at $$\alpha$$, $$\beta$$, and $$\sigma$$, but what about $$\mu$$ and the overall posterior distribution of the outcome $$y$$ (or $$\operatorname{Normal}(\mu_i, \sigma)$$)? This is where life gets a little trickier (and why this guide exists in the first place!). Both $$\mu$$ and the posterior for $$y$$ represent penguin body mass, but conceptually they’re different things. We’ll extract these different distributions with three different brms functions: posterior_predict(), posterior_epred(), and posterior_linpred() (the code uses predicted_draws(), epred_draws(), and linpred_draws(); these are tidybayes’s wrappers for the corresponding brms functions).

Note the newdata argument here. We have to feed a data frame of values to plug into to make these different posterior predictions. We could feed the original dataset with newdata = penguins, which would plug each row of the data into the model and generate 4000 posterior draws for it. Given that there are 333 rows in penguins data, using newdata = penguins would give us 333 × 4,000 = 1,332,000 rows. That’s a ton of data, and looking at it all together like that isn’t super useful unless we look at predictions across a range of possible predictors. We’ll do that later in this section and see the posterior predictions of weights across a range of flipper lengths. But here we’re just interested in the prediction of the outcome based on a single value of flipper lengths. We’ll use the average (200.967 mm), but it could easily be the median or whatever arbitrary number we want.

# Make a little dataset of just the average flipper length
penguins_avg_flipper <- penguins |>
summarize(flipper_length_mm = mean(flipper_length_mm))

# Extract different types of posteriors
normal_linpred <- model_normal |>
linpred_draws(newdata = penguins_avg_flipper)

normal_epred <- model_normal |>
epred_draws(newdata = penguins_avg_flipper)

normal_predicted <- model_normal |>
predicted_draws(newdata = penguins_avg_flipper,
seed = 12345)  # So that the manual results with rnorm() are the same later

These each show the posterior distribution of penguin weight, and each corresponds to a different part of the formal mathematical model with. We can explore these nuances if we look at these distributions’ means, medians, standard deviations, and overall shapes:

Code
summary_normal_linpred <- normal_linpred |>
ungroup() |>
summarize(across(.linpred, lst(mean, sd, median), .names = "{.fn}"))

summary_normal_epred <- normal_epred |>
ungroup() |>
summarize(across(.epred, lst(mean, sd, median), .names = "{.fn}"))

summary_normal_predicted <- normal_predicted |>
ungroup() |>
summarize(across(.prediction, lst(mean, sd, median), .names = "{.fn}"))

tribble(
~Function, ~Model element,
"<code>posterior_linpred()</code>", "\$$\\mu\$$ in the model",
"<code>posterior_epred()</code>", "\$$\\operatorname{E(y)}\$$ and \$$\\mu\$$ in the model",
"<code>posterior_predict()</code>", "Random draws from posterior \$$\\operatorname{Normal}(\\mu_i, \\sigma)\$$"
) |>
bind_cols(bind_rows(summary_normal_linpred, summary_normal_epred, summary_normal_predicted)) |>
kbl(escape = FALSE) |>
kable_styling()
Table 1: Normal posteriors
Function Model element mean sd median
posterior_linpred() $$\mu$$ in the model 4206 21.8 4207
posterior_epred() $$\operatorname{E(y)}$$ and $$\mu$$ in the model 4206 21.8 4207
posterior_predict() Random draws from posterior $$\operatorname{Normal}(\mu_i, \sigma)$$ 4207 386.7 4209
p1 <- ggplot(normal_linpred, aes(x = .linpred)) +
stat_halfeye(fill = clrs[3]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(4100, 4300)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Linear predictor** <span style='font-size: 14px;'>*µ* in the model</span>",
subtitle = "posterior_linpred(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist() +
theme(plot.title = element_markdown())

p2 <- ggplot(normal_epred, aes(x = .epred)) +
stat_halfeye(fill = clrs[2]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(4100, 4300)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Expectation of the posterior** <span style='font-size: 14px;'>E[*y*] and *µ* in the model</span>",
subtitle = "posterior_epred(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist()

p3 <- ggplot(normal_predicted, aes(x = .prediction)) +
stat_halfeye(fill = clrs[1]) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(2900, 5500)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>",
subtitle = "posterior_predict(..., tibble(flipper_length_mm = 201))") +
theme_pred_dist()

(p1 / plot_spacer() / p2 / plot_spacer() / p3) +
plot_layout(heights = c(0.3, 0.05, 0.3, 0.05, 0.3))

The most obvious difference between these different posterior predictions is the range of predictions. For posterior_linpred() and posterior_epred(), the standard error is tiny and the range of plausible predicted values is really narrow. For posterior_predict(), the standard error is substantially bigger, and the corresponding range of predicted values is huge.

To understand why, let’s explore the math going on behind the scenes in these functions. Both posterior_linpred() and posterior_epred() correspond to the $$\mu$$ part of the model. They’re the average penguin weight as predicted by the linear model (hence linpred; linear predictor). We can see this if we plug a 201 mm flipper length into each row of the posterior and calculate mu by hand with $$\beta_0 + (\beta_1 \times 201)$$:

linpred_manual <- model_normal |>
mutate(mu = b_Intercept +
(b_flipper_length_mm * penguins_avg_flipper$flipper_length_mm)) linpred_manual ## # A tibble: 4,000 × 6 ## .chain .iteration .draw b_Intercept b_flipper_length_mm mu ## <int> <int> <int> <dbl> <dbl> <dbl> ## 1 1 1 1 -6152. 51.5 4204. ## 2 1 2 2 -5872 50.2 4221. ## 3 1 3 3 -6263. 52.1 4202. ## 4 1 4 4 -6066. 51.1 4213. ## 5 1 5 5 -5740. 49.4 4191. ## 6 1 6 6 -5678. 49.2 4213. ## 7 1 7 7 -6107. 51.1 4160. ## 8 1 8 8 -5422. 48.0 4235. ## 9 1 9 9 -6303. 52.1 4177. ## 10 1 10 10 -6193. 51.6 4184. ## # … with 3,990 more rows That mu column is identical to what we calculate with posterior_linpred(). Just to confirm, we can plot the two distributions: p1_manual <- linpred_manual |> ggplot(aes(x = mu)) + stat_halfeye(fill = colorspace::lighten(clrs[3], 0.5)) + scale_x_continuous(labels = label_comma()) + coord_cartesian(xlim = c(4100, 4300)) + labs(x = "Body mass (g)", y = NULL, title = "**Linear predictor** <span style='font-size: 14px;'>*µ* in the model</span>", subtitle = "b_Intercept + (b_flipper_length_mm * 201)") + theme_pred_dist() + theme(plot.title = element_markdown()) p1_manual | p1 Importantly, the distribution of the $$\mu$$ part of the model here does not incorporate information about $$\sigma$$. That’s why the distribution is so narrow. The results from posterior_predict(), on the other hand, correspond to the $$y$$ part of the model. Officially, they are draws from a random normal distribution using both the estimated $$\mu$$ and the estimated $$\sigma$$. These results contain the full uncertainty of the posterior distribution of penguin weight. To help with the intuition, we can do the same thing by hand when plugging in a 201 mm flipper length: set.seed(12345) # To get the same results as posterior_predict() from earlier postpred_manual <- model_normal |> spread_draws(b_Intercept, b_flipper_length_mm, sigma) |> mutate(mu = b_Intercept + (b_flipper_length_mm * penguins_avg_flipper$flipper_length_mm),  # This is posterior_linpred()
y_new = rnorm(n(), mean = mu, sd = sigma))  # This is posterior_predict()

postpred_manual |>
select(.draw:y_new)
## # A tibble: 4,000 × 6
##    .draw b_Intercept b_flipper_length_mm sigma    mu y_new
##    <int>       <dbl>               <dbl> <dbl> <dbl> <dbl>
##  1     1      -6152.                51.5  384. 4204. 4429.
##  2     2      -5872                 50.2  401. 4221. 4506.
##  3     3      -6263.                52.1  390. 4202. 4159.
##  4     4      -6066.                51.1  409. 4213. 4027.
##  5     5      -5740.                49.4  362. 4191. 4411.
##  6     6      -5678.                49.2  393. 4213. 3499.
##  7     7      -6107.                51.1  417. 4160. 4423.
##  8     8      -5422.                48.0  351. 4235. 4138.
##  9     9      -6303.                52.1  426. 4177. 4055.
## 10    10      -6193.                51.6  426. 4184. 3793.
## # … with 3,990 more rows

That y_new column here is the $$y$$ part of the model and should have a lot more uncertainty than the mu column, which is just the $$\mu$$ part of the model. Notably, the y_new column is the same as what we get when using posterior predict(). We’ll plot the two distributions to confirm:

p3_manual <- postpred_manual |>
ggplot(aes(x = y_new)) +
stat_halfeye(fill = colorspace::lighten(clrs[1], 0.5)) +
scale_x_continuous(labels = label_comma()) +
coord_cartesian(xlim = c(2900, 5500)) +
labs(x = "Body mass (g)", y = NULL,
title = "**Posterior predictions** <span style='font-size: 14px;'>Random draws from posterior Normal(*µ*, *σ*)</span>",
subtitle = "rnorm(b_Intercept + (b_flipper_length_mm * 201), sigma)") +
theme_pred_dist() +
theme(plot.title = element_markdown())

p3_manual | p3

The results from posterior_predict() and posterior_linpred() have the same mean, but the full posterior predictions that incorporate the estimated $$\sigma$$ have a much wider range of plausible values.

The results from posterior_epred() are a little strange to understand, and in the case of normal/Gaussian regression (and many other types of regression models!), they’re identical to the linear predictor (posterior_linpred()). These are the posterior draws of the expected value or mean of the the posterior distribution, or $$E(y_i)$$ in the model. Behind the scenes, this is calculated by taking the average of each row’s posterior distribution and then taking the average of that.

Once again, a quick illustration can help. As before, we’ll manually plug a flipper length of 201 mm into the posterior estimates of the intercept and slope to calculate the $$\mu$$ part of the model. We’ll then use that $$\mu$$ along with the estimated $$\sigma$$ to in rnorm() to generate the posterior predictive distribution, or the $$y$$ part of the model. Finally, we’ll take the average of the y_new posterior predictive distribution to get the expectation of the posterior predictive distribution, or epred. It’s the same as what we get when using posterior_epred(); the only differences are because of randomness.

epred_manual <- model_normal |>
spread_draws(b_Intercept, b_flipper_length_mm, sigma) |>
mutate(mu = b_Intercept +
(b_flipper_length_mm *

## tl;dr: Diagrams and cheat sheets

Keeping track of which kinds of posterior predictions you’re working with, on which scales, and for which parameters, can be tricky, especially with more complex models with lots of moving parts. To make life easier, here are all the summary diagrams in one place:

### Complete cheat sheet

And here’s an even more detailed summary cheat sheet as a printable PDF: