How to use natural and base 10 log scales in ggplot2

Use the {scales} R package to automatically adjust and format x- and y-axis scales to use log base 10 and natural log values
r
tidyverse
ggplot
data visualization
Author
Published

Thursday, December 8, 2022

Doi

I always forget how to deal with logged values in ggplot—particularly things that use the natural log. The {scales} package was invented in part to allow users to adjust axes and scales in plots, including adjusting axes to account for logged values, but there have been some new developments in {scales} that have made existing answers (like this one on StackOverflow) somewhat obsolete (e.g. the trans_breaks() and trans_format() functions used there are superceded and deprecated).

So here’s a quick overview of how to use 2022-era {scales} to adjust axis breaks and labels to use both base 10 logs and natural logs. I’ll use data from the Gapminder project, since it has a nice exponentially-distributed measure of GDP per capita.

library(tidyverse)
library(scales)
library(gapminder)
library(patchwork)

# Just look at one year
gapminder_2007 <- gapminder |>
  filter(year == 2007)

theme_set(theme_bw() + theme(plot.title = element_text(face = "bold")))

Original unlogged values

The distribution of GDP per capita is heavily skewed, with most countries reporting less than $10,000. As a result, the scatterplot makes an upside-down L shape. Try sticking a regression line on that and you’ll get in trouble.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  guides(color = "none") +
  labs(title = "GDP per capita",
       subtitle = "Original non-logged values")

Scatterplot of GDP per capita and life expectancy. GDP per capita is exponentially distributed so it is heavily skewed with most observations under $10,000. The resulting shape of the plot is not linear.

Log base 10

ggplot comes with a built-in scale_x_log10() to transform the x-axis into logged values. It will automatically create pretty, logical breaks based on the data. Here, the breaks automatically go from 300 → 1000 → 3000 → 10000, and so on:

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_log10() +
  guides(color = "none") +
  labs(title = "GDP per capita, log base 10",
       subtitle = "scale_x_log10()") +
  theme(panel.grid.minor = element_blank())

The x-axis now shows GDP per capita scaled to log base 10, with axis breaks at 300, 1000, 3000, 100000, and 30000. The relationship is much more linear now.

If we want to be mathy about the labels, we can format them as base 10 exponents using label_log():

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_log10(labels = label_log(digits = 2)) +
  guides(color = "none") +
  labs(title = "GDP per capita, log base 10",
       subtitle = "scale_x_log10() with exponentiated labels") +
  theme(panel.grid.minor = element_blank())

The x-axis shows logged values, but instead of displaying dollar amounts like 300, 1000, etc., it displays exponents like \(10^{2.5}\) and \(10^3\).

What if we don’t want the default 300, 1000, 3000, etc. breaks? In the interactive plot at gapminder.org, the breaks start at 500 and double after that: 500, 1000, 2000, 4000, 8000, etc. We can control our axis breaks by feeding a list of numbers to scale_x_log10() with the breaks argument. Instead of typing out every possible break, we can generate a list of numbers starting at 500 and then doubling (\(500 \times 2^0\), \(500 \times 2^1\), \(500 \times 2^2\), and so on):

500 * 2^seq(0, 8, by = 1)
## [1]    500   1000   2000   4000   8000  16000  32000  64000 128000

For bonus fun, we’ll format the breaks as dollars and use the new-as-of-{scales}-1.2.0 cut_short_scale() to shorten the values:

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_log10(breaks = 500 * 2^seq(0, 9, by = 1),
                labels = label_dollar(scale_cut = cut_short_scale())) +
  guides(color = "none") +
  labs(title = "GDP per capita, log base 10",
       subtitle = "scale_x_log10() + more logical breaks") +
  theme(panel.grid.minor = element_blank())

The x-axis shows logged values, but instead using the default automatic breaks at 300, 1000, etc., it has breaks at 500, 1000, 2000, 4000, etc.

Log base \(e\), or the natural log

Log base 10 makes sense for visualizing things. Seeing the jumps from $500 → $1000 → $2000 is generally easy for people to understand (especially in today’s world of exponentially growing COVID cases). When working with logged values for statistical modeling, analysts prefer to use the natural log, or log base \(e\) instead.

What the heck is \(e\)?

Here are a bunch of helpful resources explaining what \(e\) and the natural log are and why analysts use them all the time:

The default logging function in R, log(), calculates the natural log (you have to use log10() or log(base = 10) to get base 10 logs).

Plotting natural logged values is a little trickier than base 10 values, since ggplot doesn’t have anything like scale_x_log_e(). But it’s still doable.

First, we can log the value on our own and just use the default scale_x_continuous() for labeling:

ggplot(gapminder_2007, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
  geom_point() +
  guides(color = "none") +
  labs(title = "GDP per capita, natural log (base e)",
       subtitle = "GDP per capita logged manually")

The x-axis now shows GDP per capita scaled to log base \(e\), or the natural log, with axis breaks at 6, 7, 8, 9, 10, and 11. The relationship still linear, just like log base 10, but the values are less interpretable. The values on the x-axis were logged before being fed to ggplot.

Those 6, 7, 8, etc. breaks in the x-axis represent the power \(e\) is raised to, like \(e^6\) and \(e^8\). We can format these labels as exponents to make that clearer:

ggplot(gapminder_2007, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_continuous(labels = label_math(e^.x)) +
  guides(color = "none") +
  labs(title = "GDP per capita, natural log (base e)",
       subtitle = "GDP per capita logged manually, exponentiated labels")

The x-axis labels show natural log values as exponents for \(e\): \(e^6\), \(e^7\), and so on. They’re still tricky to interpret, but now it shows that they’re at least based on \(e\) instead of being actual values like 6. The values on the x-axis were logged before being fed to ggplot.

To get these labels, we have to pre-log GDP per capita. We didn’t need to pre-log the varialb when using scale_x_log10(), since that logs things for us. We can have the scale_x_*() function handle the natural logging for us too by specifying trans = log_trans():

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_continuous(trans = log_trans()) +
  guides(color = "none") +
  labs(title = "GDP per capita, natural log (base e)",
       subtitle = "trans = log_trans()")

The values on the x-axis are now logged by ggplot. The x-axis labels are on the dollar scale instead of the log scale. This makes it a little easier to interpret, but the numbers are gross: 1096.633, 8103.084, and 59874.142, or \(e^7\), \(e^9\), and \(e^{11}\)

Everything is logged as expected, but those labels are gross—they’re \(e^7\), \(e^9\), and \(e^{11}\), but on the dollar scale:

exp(c(7, 9, 11))
## [1]  1096.633  8103.084 59874.142

We can format these breaks as \(e\)-based exponents instead with label_math() (with the format = log argument to make the formatting function log the values first):

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_continuous(trans = log_trans(),
                     # This breaks_log() thing happens behind the scenes and
                     # isn't strictly necessary here
                     # breaks = breaks_log(base = exp(1)),
                     labels = label_math(e^.x, format = log)) +
  guides(color = "none") +
  labs(title = "GDP per capita, natural log (base e)",
       subtitle = "trans = log_trans(), exponentiated labels")

The x-axis now shows GDP per capita scaled to log base \(e\), or the natural log, with automatic axis breaks at 7, 9, and 11. The values on the x-axis are logged automatically with trans = log_trans().

If we want more breaks than 7, 9, 11, we can feed the scaling function a list of exponentiated breaks:

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_continuous(trans = log_trans(),
                     breaks = exp(6:11),
                     labels = label_math(e^.x, format = log)) +
  guides(color = "none") +
  labs(title = "GDP per capita, natural log (base e)",
       subtitle = "trans = log_trans(), exponentiated labels, custom breaks")

The x-axis now shows GDP per capita scaled to log base \(e\), or the natural log, with axis breaks at 6, 7, 8, 9, 10, and 11. The values on the x-axis are logged automatically with trans = log_trans().

Citation

BibTeX citation:
@online{heiss2022,
  author = {Heiss, Andrew},
  title = {How to Use Natural and Base 10 Log Scales in Ggplot2},
  pages = {undefined},
  date = {2022-12-08},
  url = {https://www.andrewheiss.com/blog/2022/12/08/log10-natural-log-scales-ggplot},
  doi = {10.59350/b4gjd-50c81},
  langid = {en}
}
For attribution, please cite this work as:
Heiss, Andrew. 2022. “How to Use Natural and Base 10 Log Scales in Ggplot2.” December 8, 2022. https://doi.org/10.59350/b4gjd-50c81.