# Manually generate predicted values for logistic regression with matrix multiplication in R

This is like basic stats stuff, but I can never remember how to do it—here’s how to use matrix multiplication to replicate the results of `predict()`
r
statistics
regression

In a project I’m working on, I need to generate predictions from a logistic regression model. That’s typically a really straightforward task—we can just use `predict()` to plug a dataset of values into a model, which will spit out predictions either on the (gross, uninterpretable) log odds scale or on the (nice, interpretable) percentage-point scale.1

• 1 And for pro-level predictions, use `predictions()` from {marginaleffects}.

• However, in this project I cannot use `predict()`—I’m working with a big matrix of posterior coefficient draws from a Bayesian model fit with raw Stan code, so there’s no special `predict()` function that will work. Instead, I need to use matrix multiplication and manually multiply a matrix of new data with a vector of slopes from the model.

I haven’t had to matrix multiply coefficients with data since my first PhD stats class back in 2012 and I’ve completely forgotten how.

I created this little guide as a reference for myself, and I figured it’d probably be useful for others, so up on the blog it goes (see the introduction to this post for more about my philosophy of public work). Image generated with Bing Image Creator with the prompt “regression matrix multiplication in a sketchy style”

## Example 1: Logistic regression model with an intercept and slopes

Here’s a basic regression model where we predict if a penguin is a Gentoo based on its bill length and body mass.

``````library(palmerpenguins)

penguins <- penguins |>
tidyr::drop_na(sex) |>
dplyr::mutate(is_gentoo = species == "Gentoo")

model <- glm(
is_gentoo ~ bill_length_mm + body_mass_g,
data = penguins,
)``````

We can generate predicted values across different three different values of bill length: 40 mm, 44 mm, and 48 mm, holding body mass constant at the average (4207 g):

``````data_to_plug_in <- expand.grid(
bill_length_mm = c(40, 44, 48),
body_mass_g = mean(penguins\$body_mass_g)
)
data_to_plug_in
##   bill_length_mm body_mass_g
## 1             40        4207
## 2             44        4207
## 3             48        4207``````

We can feed this little dataset into the model using `predict()`, which can generate predictions as log odds (`type = "link"`) or probabilities (`type = "response"`):

``````predict(model, newdata = data_to_plug_in, type = "link")
##      1      2      3
## -2.097 -1.741 -1.385
predict(model, newdata = data_to_plug_in, type = "response")
##      1      2      3
## 0.1094 0.1492 0.2002``````

Yay, nice and easy.

Here’s how to do the same thing with manual matrix multiplication:

``````# Get all the coefficients
(coefs <- coef(model))
##    (Intercept) bill_length_mm    body_mass_g
##     -32.401514       0.088906       0.006358

# Split intercept and slope coefficients into separate objects
(intercept <- coefs)
## (Intercept)
##       -32.4
(slopes <- coefs[-1])
## bill_length_mm    body_mass_g
##       0.088906       0.006358

# Convert the data frame of new data into a matrix
(data_to_plug_in_mat <- as.matrix(data_to_plug_in))
##      bill_length_mm body_mass_g
## [1,]             40        4207
## [2,]             44        4207
## [3,]             48        4207

# Matrix multiply the new data with the slope coefficients, then add the intercept
(log_odds <- as.numeric((data_to_plug_in_mat %*% slopes) + intercept))
##  -2.097 -1.741 -1.385

# Convert to probability scale
plogis(log_odds)
##  0.1094 0.1492 0.2002``````

The results are the same as `predict()`:

``````predict(model, newdata = data_to_plug_in, type = "link")
##      1      2      3
## -2.097 -1.741 -1.385
log_odds
##  -2.097 -1.741 -1.385

predict(model, newdata = data_to_plug_in, type = "response")
##      1      2      3
## 0.1094 0.1492 0.2002
plogis(log_odds)
##  0.1094 0.1492 0.2002``````

## Example 2: Logistic regression model with a categorical predictor and no intercept

This gets a little more complex when working with categorical predictors, especially if you’ve omitted the intercept term. For instance, in the data I’m working with, we have a model that looks something like this, with `0` added as a term to suppress the intercept and give separate coefficients for each of the levels of `sex`:

``````model_categorical <- glm(
is_gentoo ~ 0 + sex + bill_length_mm + body_mass_g,
data = penguins,
)
coef(model_categorical)
##      sexfemale        sexmale bill_length_mm    body_mass_g
##      -75.68237      -89.88199        0.03387        0.01831``````

When using `predict()`, we don’t have to do anything special with this intercept-free model. We can plug in a dataset with different variations of predictors:

``````data_to_plug_in_cat <- expand.grid(
sex = c("female", "male"),
bill_length_mm = c(40, 44, 48),
body_mass_g = mean(penguins\$body_mass_g)
)
data_to_plug_in_cat
##      sex bill_length_mm body_mass_g
## 1 female             40        4207
## 2   male             40        4207
## 3 female             44        4207
## 4   male             44        4207
## 5 female             48        4207
## 6   male             48        4207

predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
##       1       2       3       4       5       6
##   2.707 -11.493   2.842 -11.357   2.978 -11.222
predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
##         1         2         3         4         5         6
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05``````

If we want to do this manually, we have to create a matrix version of `data_to_plug_in_cat` that has separate columns for `sexfemale` and `sexmale`. We can’t just use `as.matrix(data_to_plug_in_cat)`, since that only has a single column for `sex` (and because that column contains text, it forces the rest of the matrix to be text, which makes it so we can’t do math with it anymore):

``````as.matrix(data_to_plug_in_cat)
##      sex      bill_length_mm body_mass_g
## [1,] "female" "40"           "4207"
## [2,] "male"   "40"           "4207"
## [3,] "female" "44"           "4207"
## [4,] "male"   "44"           "4207"
## [5,] "female" "48"           "4207"
## [6,] "male"   "48"           "4207"``````

Instead, we can use `model.matrix()` to create a design matrix—also called a dummy-encoded matrix2 or a one-hot encoded matrix—which makes columns of 0s and 1s for each of the levels of `sex`

• 2 Though we should probably quit using the word “dummy” because of its ableist connotations—see Google’s developer documentation style guide for alternatives.

• ``````data_to_plug_in_cat_mat <- model.matrix(
~ 0 + ., data = data_to_plug_in_cat
)
data_to_plug_in_cat_mat
##   sexfemale sexmale bill_length_mm body_mass_g
## 1         1       0             40        4207
## 2         0       1             40        4207
## 3         1       0             44        4207
## 4         0       1             44        4207
## 5         1       0             48        4207
## 6         0       1             48        4207
## attr(,"assign")
##  1 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")\$sex
##  "contr.treatment"``````

We can now do math with this matrix. Since we don’t have an intercept term, we don’t need to create separate objects for the slopes and intercepts and can matrix multiply the new data matrix with the model coefficients:

``````# Get all the coefficients
(coefs_cat <- coef(model_categorical))
##      sexfemale        sexmale bill_length_mm    body_mass_g
##      -75.68237      -89.88199        0.03387        0.01831

# Matrix multiply the newdata with the slope coefficients
(log_odds_cat <- as.numeric(data_to_plug_in_cat_mat %*% coefs_cat))
##    2.707 -11.493   2.842 -11.357   2.978 -11.222

# Convert to probability scale
plogis(log_odds_cat)
##  9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05``````

The results are the same!

``````predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
##       1       2       3       4       5       6
##   2.707 -11.493   2.842 -11.357   2.978 -11.222
log_odds_cat
##    2.707 -11.493   2.842 -11.357   2.978 -11.222

predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
##         1         2         3         4         5         6
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
plogis(log_odds_cat)
##  9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05``````

## Citation

BibTeX citation:
``````@online{heiss2023,
author = {Heiss, Andrew},
title = {Manually Generate Predicted Values for Logistic Regression
with Matrix Multiplication in {R}},
pages = {undefined},
date = {2023-08-15},
url = {https://www.andrewheiss.com/blog/2023/08/15/matrix-multiply-logit-predict},
doi = {10.59350/qba9a-b3561},
langid = {en}
}
``````