Manually generate predicted values for logistic regression with matrix multiplication in R

In a project I’m working on, I need to generate predictions from a logistic regression model. That’s typically a really straightforward task—we can just use predict() to plug a dataset of values into a model, which will spit out predictions either on the (gross, uninterpretable) log odds scale or on the (nice, interpretable) percentage-point scale.¹

¹ And for pro-level predictions, use predictions() from {marginaleffects}.

However, in this project I cannot use predict()—I’m working with a big matrix of posterior coefficient draws from a Bayesian model fit with raw Stan code, so there’s no special predict() function that will work. Instead, I need to use matrix multiplication and manually multiply a matrix of new data with a vector of slopes from the model.

I haven’t had to matrix multiply coefficients with data since my first PhD stats class back in 2012 and I’ve completely forgotten how.

I created this little guide as a reference for myself, and I figured it’d probably be useful for others, so up on the blog it goes (see the introduction to this post for more about my philosophy of public work).

Example 1: Logistic regression model with an intercept and slopes

Here’s a basic regression model where we predict if a penguin is a Gentoo based on its bill length and body mass.

library(palmerpenguins)
## 
## Attaching package: 'palmerpenguins'
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw

penguins <- penguins |> 
  tidyr::drop_na(sex) |> 
  dplyr::mutate(is_gentoo = species == "Gentoo")

model <- glm(
  is_gentoo ~ bill_length_mm + body_mass_g,
  data = penguins,
  family = binomial(link = "logit")
)

We can generate predicted values across different three different values of bill length: 40 mm, 44 mm, and 48 mm, holding body mass constant at the average (4207 g):

data_to_plug_in <- expand.grid(
  bill_length_mm = c(40, 44, 48),
  body_mass_g = mean(penguins$body_mass_g)
)
data_to_plug_in
##   bill_length_mm body_mass_g
## 1             40        4207
## 2             44        4207
## 3             48        4207

We can feed this little dataset into the model using predict(), which can generate predictions as log odds (type = "link") or probabilities (type = "response"):

predict(model, newdata = data_to_plug_in, type = "link")
##      1      2      3 
## -2.097 -1.741 -1.385
predict(model, newdata = data_to_plug_in, type = "response")
##      1      2      3 
## 0.1094 0.1492 0.2002

Yay, nice and easy.

Here’s how to do the same thing with manual matrix multiplication:

# Get all the coefficients
(coefs <- coef(model))
##    (Intercept) bill_length_mm    body_mass_g 
##     -32.401514       0.088906       0.006358

# Split intercept and slope coefficients into separate objects
(intercept <- coefs[1])
## (Intercept) 
##       -32.4
(slopes <- coefs[-1])
## bill_length_mm    body_mass_g 
##       0.088906       0.006358

# Convert the data frame of new data into a matrix
(data_to_plug_in_mat <- as.matrix(data_to_plug_in))
##      bill_length_mm body_mass_g
## [1,]             40        4207
## [2,]             44        4207
## [3,]             48        4207

# Matrix multiply the new data with the slope coefficients, then add the intercept
(log_odds <- as.numeric((data_to_plug_in_mat %*% slopes) + intercept))
## [1] -2.097 -1.741 -1.385

# Convert to probability scale
plogis(log_odds)
## [1] 0.1094 0.1492 0.2002

The results are the same as predict():

predict(model, newdata = data_to_plug_in, type = "link")
##      1      2      3 
## -2.097 -1.741 -1.385
log_odds
## [1] -2.097 -1.741 -1.385

predict(model, newdata = data_to_plug_in, type = "response")
##      1      2      3 
## 0.1094 0.1492 0.2002
plogis(log_odds)
## [1] 0.1094 0.1492 0.2002

Example 2: Logistic regression model with a categorical predictor and no intercept

This gets a little more complex when working with categorical predictors, especially if you’ve omitted the intercept term. For instance, in the data I’m working with, we have a model that looks something like this, with 0 added as a term to suppress the intercept and give separate coefficients for each of the levels of sex:

model_categorical <- glm(
  is_gentoo ~ 0 + sex + bill_length_mm + body_mass_g,
  data = penguins,
  family = binomial(link = "logit")
)
coef(model_categorical)
##      sexfemale        sexmale bill_length_mm    body_mass_g 
##      -75.68237      -89.88199        0.03387        0.01831

When using predict(), we don’t have to do anything special with this intercept-free model. We can plug in a dataset with different variations of predictors:

data_to_plug_in_cat <- expand.grid(
  sex = c("female", "male"),
  bill_length_mm = c(40, 44, 48),
  body_mass_g = mean(penguins$body_mass_g)
)
data_to_plug_in_cat
##      sex bill_length_mm body_mass_g
## 1 female             40        4207
## 2   male             40        4207
## 3 female             44        4207
## 4   male             44        4207
## 5 female             48        4207
## 6   male             48        4207

predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
##       1       2       3       4       5       6 
##   2.707 -11.493   2.842 -11.357   2.978 -11.222
predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
##         1         2         3         4         5         6 
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

If we want to do this manually, we have to create a matrix version of data_to_plug_in_cat that has separate columns for sexfemale and sexmale. We can’t just use as.matrix(data_to_plug_in_cat), since that only has a single column for sex (and because that column contains text, it forces the rest of the matrix to be text, which makes it so we can’t do math with it anymore):

as.matrix(data_to_plug_in_cat)
##      sex      bill_length_mm body_mass_g
## [1,] "female" "40"           "4207"     
## [2,] "male"   "40"           "4207"     
## [3,] "female" "44"           "4207"     
## [4,] "male"   "44"           "4207"     
## [5,] "female" "48"           "4207"     
## [6,] "male"   "48"           "4207"

Instead, we can use model.matrix() to create a design matrix—also called a dummy-encoded matrix² or a one-hot encoded matrix—which makes columns of 0s and 1s for each of the levels of sex

² Though we should probably quit using the word “dummy” because of its ableist connotations—see Google’s developer documentation style guide for alternatives.

data_to_plug_in_cat_mat <- model.matrix(
  ~ 0 + ., data = data_to_plug_in_cat
)
data_to_plug_in_cat_mat
##   sexfemale sexmale bill_length_mm body_mass_g
## 1         1       0             40        4207
## 2         0       1             40        4207
## 3         1       0             44        4207
## 4         0       1             44        4207
## 5         1       0             48        4207
## 6         0       1             48        4207
## attr(,"assign")
## [1] 1 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$sex
## [1] "contr.treatment"

We can now do math with this matrix. Since we don’t have an intercept term, we don’t need to create separate objects for the slopes and intercepts and can matrix multiply the new data matrix with the model coefficients:

# Get all the coefficients
(coefs_cat <- coef(model_categorical))
##      sexfemale        sexmale bill_length_mm    body_mass_g 
##      -75.68237      -89.88199        0.03387        0.01831

# Matrix multiply the newdata with the slope coefficients
(log_odds_cat <- as.numeric(data_to_plug_in_cat_mat %*% coefs_cat))
## [1]   2.707 -11.493   2.842 -11.357   2.978 -11.222

# Convert to probability scale
plogis(log_odds_cat)
## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

The results are the same!

predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
##       1       2       3       4       5       6 
##   2.707 -11.493   2.842 -11.357   2.978 -11.222
log_odds_cat
## [1]   2.707 -11.493   2.842 -11.357   2.978 -11.222

predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
##         1         2         3         4         5         6 
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
plogis(log_odds_cat)
## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

Citation

BibTeX citation:

@online{heiss2023,
  author = {Heiss, Andrew},
  title = {Manually Generate Predicted Values for Logistic Regression
    with Matrix Multiplication in {R}},
  date = {2023-08-15},
  url = {https://www.andrewheiss.com/blog/2023/08/15/matrix-multiply-logit-predict/},
  doi = {10.59350/qba9a-b3561},
  langid = {en}
}

For attribution, please cite this work as:

Heiss, Andrew. 2023. “Manually Generate Predicted Values for Logistic Regression with Matrix Multiplication in R.” August 15, 2023. https://doi.org/10.59350/qba9a-b3561.