Manually generate predicted values for logistic regression with matrix multiplication in R

This is like basic stats stuff, but I can never remember how to do it—here’s how to use matrix multiplication to replicate the results of predict()
r
statistics
regression
Author
Published

Tuesday, August 15, 2023

Doi

In a project I’m working on, I need to generate predictions from a logistic regression model. That’s typically a really straightforward task—we can just use predict() to plug a dataset of values into a model, which will spit out predictions either on the (gross, uninterpretable) log odds scale or on the (nice, interpretable) percentage-point scale.1

  • 1 And for pro-level predictions, use predictions() from {marginaleffects}.

  • However, in this project I cannot use predict()—I’m working with a big matrix of posterior coefficient draws from a Bayesian model fit with raw Stan code, so there’s no special predict() function that will work. Instead, I need to use matrix multiplication and manually multiply a matrix of new data with a vector of slopes from the model.

    I haven’t had to matrix multiply coefficients with data since my first PhD stats class back in 2012 and I’ve completely forgotten how.

    I created this little guide as a reference for myself, and I figured it’d probably be useful for others, so up on the blog it goes (see the introduction to this post for more about my philosophy of public work).

    Image generated with Bing Image Creator with the prompt “regression matrix multiplication in a sketchy style”

    Example 1: Logistic regression model with an intercept and slopes

    Here’s a basic regression model where we predict if a penguin is a Gentoo based on its bill length and body mass.

    library(palmerpenguins)
    
    penguins <- penguins |> 
      tidyr::drop_na(sex) |> 
      dplyr::mutate(is_gentoo = species == "Gentoo")
    
    model <- glm(
      is_gentoo ~ bill_length_mm + body_mass_g,
      data = penguins,
      family = binomial(link = "logit")
    )

    We can generate predicted values across different three different values of bill length: 40 mm, 44 mm, and 48 mm, holding body mass constant at the average (4207 g):

    data_to_plug_in <- expand.grid(
      bill_length_mm = c(40, 44, 48),
      body_mass_g = mean(penguins$body_mass_g)
    )
    data_to_plug_in
    ##   bill_length_mm body_mass_g
    ## 1             40        4207
    ## 2             44        4207
    ## 3             48        4207

    We can feed this little dataset into the model using predict(), which can generate predictions as log odds (type = "link") or probabilities (type = "response"):

    predict(model, newdata = data_to_plug_in, type = "link")
    ##      1      2      3 
    ## -2.097 -1.741 -1.385
    predict(model, newdata = data_to_plug_in, type = "response")
    ##      1      2      3 
    ## 0.1094 0.1492 0.2002

    Yay, nice and easy.

    Here’s how to do the same thing with manual matrix multiplication:

    # Get all the coefficients
    (coefs <- coef(model))
    ##    (Intercept) bill_length_mm    body_mass_g 
    ##     -32.401514       0.088906       0.006358
    
    # Split intercept and slope coefficients into separate objects
    (intercept <- coefs[1])
    ## (Intercept) 
    ##       -32.4
    (slopes <- coefs[-1])
    ## bill_length_mm    body_mass_g 
    ##       0.088906       0.006358
    
    # Convert the data frame of new data into a matrix
    (data_to_plug_in_mat <- as.matrix(data_to_plug_in))
    ##      bill_length_mm body_mass_g
    ## [1,]             40        4207
    ## [2,]             44        4207
    ## [3,]             48        4207
    
    # Matrix multiply the new data with the slope coefficients, then add the intercept
    (log_odds <- as.numeric((data_to_plug_in_mat %*% slopes) + intercept))
    ## [1] -2.097 -1.741 -1.385
    
    # Convert to probability scale
    plogis(log_odds)
    ## [1] 0.1094 0.1492 0.2002

    The results are the same as predict():

    predict(model, newdata = data_to_plug_in, type = "link")
    ##      1      2      3 
    ## -2.097 -1.741 -1.385
    log_odds
    ## [1] -2.097 -1.741 -1.385
    
    predict(model, newdata = data_to_plug_in, type = "response")
    ##      1      2      3 
    ## 0.1094 0.1492 0.2002
    plogis(log_odds)
    ## [1] 0.1094 0.1492 0.2002

    Example 2: Logistic regression model with a categorical predictor and no intercept

    This gets a little more complex when working with categorical predictors, especially if you’ve omitted the intercept term. For instance, in the data I’m working with, we have a model that looks something like this, with 0 added as a term to suppress the intercept and give separate coefficients for each of the levels of sex:

    model_categorical <- glm(
      is_gentoo ~ 0 + sex + bill_length_mm + body_mass_g,
      data = penguins,
      family = binomial(link = "logit")
    )
    coef(model_categorical)
    ##      sexfemale        sexmale bill_length_mm    body_mass_g 
    ##      -75.68237      -89.88199        0.03387        0.01831

    When using predict(), we don’t have to do anything special with this intercept-free model. We can plug in a dataset with different variations of predictors:

    data_to_plug_in_cat <- expand.grid(
      sex = c("female", "male"),
      bill_length_mm = c(40, 44, 48),
      body_mass_g = mean(penguins$body_mass_g)
    )
    data_to_plug_in_cat
    ##      sex bill_length_mm body_mass_g
    ## 1 female             40        4207
    ## 2   male             40        4207
    ## 3 female             44        4207
    ## 4   male             44        4207
    ## 5 female             48        4207
    ## 6   male             48        4207
    
    predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
    ##       1       2       3       4       5       6 
    ##   2.707 -11.493   2.842 -11.357   2.978 -11.222
    predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
    ##         1         2         3         4         5         6 
    ## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

    If we want to do this manually, we have to create a matrix version of data_to_plug_in_cat that has separate columns for sexfemale and sexmale. We can’t just use as.matrix(data_to_plug_in_cat), since that only has a single column for sex (and because that column contains text, it forces the rest of the matrix to be text, which makes it so we can’t do math with it anymore):

    as.matrix(data_to_plug_in_cat)
    ##      sex      bill_length_mm body_mass_g
    ## [1,] "female" "40"           "4207"     
    ## [2,] "male"   "40"           "4207"     
    ## [3,] "female" "44"           "4207"     
    ## [4,] "male"   "44"           "4207"     
    ## [5,] "female" "48"           "4207"     
    ## [6,] "male"   "48"           "4207"

    Instead, we can use model.matrix() to create a design matrix—also called a dummy-encoded matrix2 or a one-hot encoded matrix—which makes columns of 0s and 1s for each of the levels of sex

  • 2 Though we should probably quit using the word “dummy” because of its ableist connotations—see Google’s developer documentation style guide for alternatives.

  • data_to_plug_in_cat_mat <- model.matrix(
      ~ 0 + ., data = data_to_plug_in_cat
    )
    data_to_plug_in_cat_mat
    ##   sexfemale sexmale bill_length_mm body_mass_g
    ## 1         1       0             40        4207
    ## 2         0       1             40        4207
    ## 3         1       0             44        4207
    ## 4         0       1             44        4207
    ## 5         1       0             48        4207
    ## 6         0       1             48        4207
    ## attr(,"assign")
    ## [1] 1 1 2 3
    ## attr(,"contrasts")
    ## attr(,"contrasts")$sex
    ## [1] "contr.treatment"

    We can now do math with this matrix. Since we don’t have an intercept term, we don’t need to create separate objects for the slopes and intercepts and can matrix multiply the new data matrix with the model coefficients:

    # Get all the coefficients
    (coefs_cat <- coef(model_categorical))
    ##      sexfemale        sexmale bill_length_mm    body_mass_g 
    ##      -75.68237      -89.88199        0.03387        0.01831
    
    # Matrix multiply the newdata with the slope coefficients
    (log_odds_cat <- as.numeric(data_to_plug_in_cat_mat %*% coefs_cat))
    ## [1]   2.707 -11.493   2.842 -11.357   2.978 -11.222
    
    # Convert to probability scale
    plogis(log_odds_cat)
    ## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

    The results are the same!

    predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
    ##       1       2       3       4       5       6 
    ##   2.707 -11.493   2.842 -11.357   2.978 -11.222
    log_odds_cat
    ## [1]   2.707 -11.493   2.842 -11.357   2.978 -11.222
    
    predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
    ##         1         2         3         4         5         6 
    ## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
    plogis(log_odds_cat)
    ## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05

    Citation

    BibTeX citation:
    @online{heiss2023,
      author = {Heiss, Andrew},
      title = {Manually Generate Predicted Values for Logistic Regression
        with Matrix Multiplication in {R}},
      pages = {undefined},
      date = {2023-08-15},
      url = {https://www.andrewheiss.com/blog/2023/08/15/matrix-multiply-logit-predict},
      doi = {10.59350/qba9a-b3561},
      langid = {en}
    }
    
    For attribution, please cite this work as:
    Heiss, Andrew. 2023. “Manually Generate Predicted Values for Logistic Regression with Matrix Multiplication in R.” August 15, 2023. https://doi.org/10.59350/qba9a-b3561.