Change in Predicted Probability for a Categorical Explanatory Variable

Comparing results when we change several different variables at once gives us a sense of how predicted probabilities vary over different hypothetical, but realistic enough, scenarios.

Often, though, we are interested in a more apples-to-apples comparison in terms of what our model says about the difference in predicted probabilities when only one variable changes, holding the others constant.

In the linear probability model, this was easy: the change in predicted probability when one explanatory variable changes does not depend on the values of the other explanatory variables (unless there is an explicit interaction term in the model).

In the logit model, however, the change in predicted probability does depend on the other values.

Here we will consider the change in probability for a categorical explanatory variable.

Comparing predictions that differ on only one categorical variable

Here, we revisit the example we considered earlier on police stops in San Francisco. Before, we compared the predicted probabilities for a 19-year-old Black man stopped at 1am to a 65-year-old White woman stopped at 8am.

Instead, though, we may want to ask what the probability would be for a White man with the same values on the other variables.

We fit the same logit model as before:

Expand to see code for creating variables

library(tidyverse)
library(haven)
library(tulaverse)
library(marginaleffects)
  options(marginaleffects_print_omit = c("s.value")) # do not print distracting s value

df <- read_dta("../dta/sfpd_mv.dta")

# R: need to convert variables that imported as labeled numeric to factor variables
df <- df %>%
  mutate(sex = as_factor(male)) %>%
  mutate(race = as_factor(race)) %>%
  mutate(tod = as_factor(timeofday)) %>%
  mutate(race = relevel(race, ref="white")) %>%
  mutate(tod = relevel(tod, ref = "9am-12pm"))

# mean impute missing values for age

df$age[is.na(df$age)] <- mean(df$age, na.rm = TRUE)

model <- glm(search_vehicle ~ sex + race + tod + age + age_missing, 
             data = df, family = "binomial")
tula(model)

Family: binomial / Link: logit
AIC            = 185499.309                             Number of obs   = 563089
BIC            = 185667.927                             McFadden R-sq   = 0.1169
Log likelihood = -92734.655                             Nagelkerke R-sq = 0.1371
────────────────────────────────────────────────────────────────────────────────
search_vehicle │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
sex            │                                                                
  male         │     .7066     .01719       41.1    <.0001      .6729      .7403
race           │                                                                
  asian/paci~r │    -.4847     .02758     -17.57    <.0001     -.5387     -.4306
  black        │     1.594     .01694      94.12    <.0001      1.561      1.627
  hispanic     │     1.136     .01881      60.37    <.0001      1.099      1.173
  other        │    -.1643     .02696     -6.092    <.0001     -.2171     -.1114
tod            │                                                                
  12am-3am     │     .7363     .02814      26.16    <.0001      .6812      .7915
  3am-6am      │     .7169     .04061      17.65    <.0001      .6373      .7965
  6am-9am      │   -.08343     .03515     -2.374     .0176     -.1523    -.01454
  12pm-3pm     │     .3093     .02719      11.38    <.0001       .256      .3626
  3pm-6pm      │     .4024     .02616      15.38    <.0001      .3511      .4537
  6pm-9pm      │     .4957     .02653      18.68    <.0001      .4437      .5477
  9pm-12am     │     .5703     .02585      22.07    <.0001      .5197       .621
age            │   -.03215   .0005883     -54.64    <.0001     -.0333      -.031
age_missing    │    -.1768     .02967     -5.961    <.0001      -.235     -.1187
(Intercept)    │    -3.309     .03419      -96.8    <.0001     -3.376     -3.242
────────────────────────────────────────────────────────────────────────────────

As before, we compute our predictions using the marginaleffects package.

We can compute predicted probabilities as one variable changes by using the predictions() function. We use the variables argument to specify the variable we want to change and the newdata argument to set the values at which we hold the other variables constant.

predictions(model,
            variables = list(race = c("black", "white")),
            newdata = datagrid(
              sex = "male",
              tod = "12am-3am",
              age = 19,
              age_missing = 0
            )
          )


  sex      tod age age_missing Estimate Pr(>|z|)  2.5 % 97.5 %
 male 12am-3am  19           0   0.0775   <0.001 0.0743 0.0808
 male 12am-3am  19           0   0.2925   <0.001 0.2839 0.3014

Type: invlink(link)

p.test <- predictions(model,
            variables = list(race = c("black", "white")),
            newdata = datagrid(
              sex = "male",
              tod = "12am-3am",
              age = 19,
              age_missing = 0
            )
          )

as_tibble(p.test %>% select(race, estimate))

# A tibble: 2 × 2
  race  estimate
  <fct>    <dbl>
1 white   0.0775
2 black   0.293

Printing marginaleffects() results In making these pages, I have found the output for marginaleffects especially frustrating to work with.

In this example, I am using as_tibble() to print the data as a tibble rather than its default “pretty printing” and select() to print only those columns of the tibble that I want. I do this because the “pretty printing” output otherwise wouldn’t include the race column, and if I printed the whole tibble it would include a bunch of other columns that are not of interest here.

Instead of just getting predictions for two categories of race/ethnicity, we can get predictions for all categories if we use unique in our variables argument:

p.race <- predictions(model,
                      variables = list(race = unique),
                      newdata = datagrid(
                        sex = "male",
                        tod = "12am-3am",
                        age = 19,
                        age_missing = 0
                      )
                      )

as_tibble(p.race %>% select(race, estimate))

# A tibble: 5 × 2
  race                   estimate
  <fct>                     <dbl>
1 white                    0.0775
2 asian/pacific islander   0.0492
3 black                    0.293 
4 hispanic                 0.207 
5 other                    0.0665

Meanwhile, the comparisons() function allows us to compute the difference with the reference category directly, rather than obtaining separate predictions.

In our case, we can use comparisons() to obtain the difference in predicted probabilities for the different race/ethnic categories versus the reference category (Whites):

comparisons(model, variables = "race",
            newdata = datagrid(
              sex = "male",
              tod = "12am-3am",
              age = 19,
              age_missing = 0
            )) %>%
  select(contrast, estimate, p.value, conf.low, conf.high)


                       Contrast Estimate Pr(>|z|)   2.5 %   97.5 %
 asian/pacific islander - white  -0.0283   <0.001 -0.0313 -0.02526
 black - white                    0.2151   <0.001  0.2083  0.22184
 hispanic - white                 0.1298   <0.001  0.1241  0.13549
 other - white                   -0.0110   <0.001 -0.0144 -0.00752

. * 19-year-old male stopped at 1am
. margins, at(male=1 race=2 timeofday=1 age=19 age_missing=0) ///
>         at(male=1 race=5 timeofday=1 age=19 age_missing=0) noatlegend

Adjusted predictions                            Number of obs     =    563,089
Model VCE    : OIM

Expression   : Pr(search_vehicle), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .2925427   .0044561    65.65   0.000     .2838089    .3012764
          2  |   .0774723    .001675    46.25   0.000     .0741894    .0807553
------------------------------------------------------------------------------

We could interpret this difference in a variety of ways, including:

For a 19-year-old male stopped at 1am, the predicted percent chance of a vehicular search is 29.3% if the person is Black but only 7.7% if the person is White.
Our model estimates indicate that a Black 19-year-old man stopped at 1am is nearly four times as likely to have his vehicle searched as a White 19-year-old man stopped at 1am.
According to our model, a 19-year-old Black man stopped at 1am is 21.5 percentage points more likely to be subjected to a vehicular search than a 19-year-old White man.

Predictions across all levels of a categorical variable

We could get predictions across all levels of a categorical explanatory variable by specifying unique in the variables argument. In our example:

p.race <- predictions(model,
                      variables = list(race = unique),
                      newdata = datagrid(
                        sex = "male",
                        tod = "12am-3am",
                        age = 19,
                        age_missing = 0
                      )
                      )

as_tibble(p.race %>% select(race, estimate))

# A tibble: 5 × 2
  race                   estimate
  <fct>                     <dbl>
1 white                    0.0775
2 asian/pacific islander   0.0492
3 black                    0.293 
4 hispanic                 0.207 
5 other                    0.0665

Earlier, we had also considered the case of a 65-year-old woman stopped at 8am. It is instructive to also look at the Black-White difference in this case:

p.test <- predictions(model,
            variables = list(race = c("black", "white")),
            newdata = datagrid(
              sex = "female",
              tod = "6am-9am",
              age = 65,
              age_missing = 0
            )
          )

as_tibble(p.test %>% select(race, estimate))

# A tibble: 2 × 2
  race  estimate
  <fct>    <dbl>
1 white  0.00414
2 black  0.0201

. margins, at(male=0 race=2 timeofday=3 age=65 age_missing=0) ///
>         at(male=0 race=5 timeofday=3 age=65 age_missing=0) noatlegend

Adjusted predictions                            Number of obs     =    563,089
Model VCE    : OIM

Expression   : Pr(search_vehicle), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .0200692   .0007331    27.37   0.000     .0186323    .0215061
          2  |    .004142   .0001525    27.16   0.000     .0038431    .0044409
------------------------------------------------------------------------------

The predicted probabilities are .020 in the case of a Black woman and .004 in the case of a White woman.

Notice that in our first example the Black-White difference was 21.5 percentage points, but it’s only 1.6 percentage points here.

This is a feature of the logit model, the implied difference between two groups is smaller when the probabilities in question are relatively closer to 0 or 1 than in the middle.
In the linear probability model, the gap would remain the same regardless, even if this meant predicting negative predicted probabilities in some cases.

Also, in our earlier example the Black man was nearly four times as likely as the otherwise comparable White man to have his vehicle searched. In the current example, the Black woman is closer to five times as likely to have her vehicle searched as the comparable White woman. This relative difference in probabilities also varies under the logit model, although less than it does under the linear probability model.

Computing difference in predicted probability directly

Instead of computing two predictions and taking the difference, we can get the predicted difference directly. To do so, let’s revisit another earlier example, this one on campaign contributions in 2024. We will fit a logit model with the outcome of contribution to a Republican (vs. Democratic) presidential candidate, and our regressors will be sex (binary) and CPVI (continuous).

data <- read_csv("../csv/professor_contributions_2024.csv")

model_donors <- glm((rep_donor == "yes") ~ man + cpvi25_red, 
                data = data, 
                family = "binomial")
tula(model_donors)

Family: binomial / Link: logit
AIC            =  20616.514                            Number of obs   =   46510
BIC            =  20642.756                            McFadden R-sq   = 0.05489
Log likelihood = -10305.257                            Nagelkerke R-sq = 0.06788
────────────────────────────────────────────────────────────────────────────────
(rep_donor =~) │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
manyes         │     .9439     .04228      22.32    <.0001       .861      1.027
cpvi25_red     │     .0313    .001264      24.76    <.0001     .02882     .03378
(Intercept)    │     -3.16     .03535      -89.4    <.0001     -3.229     -3.091
────────────────────────────────────────────────────────────────────────────────

Let’s look at the difference in the probability of being a Republican donor for men vs. women in a blue district (one with a 15 point Democrat lean, or -15 on cpvi25_red). Instead of computing two predictions separately, we use the comparisons() function to compute the difference directly.

comparisons(model_donors, variables = "man",
            newdata = datagrid(
              cpvi25_red = -15
            ))


 cpvi25_red Estimate Std. Error    z Pr(>|z|)  2.5 % 97.5 %
        -15    0.038    0.00176 21.6   <0.001 0.0345 0.0414

Term: man
Type: response
Comparison: yes - no

The result here can be interpreted as men professors in a D+15 district being 3.8 percentage points more likely to donate to a Republican (vs. Democratic) presidential candidate in 2024 than women professors.

Now, let’s look at the same difference for a red district (R+15).

comparisons(model_donors, variables = "man",
            newdata = datagrid(
              cpvi25_red = 15
            ))


 cpvi25_red Estimate Std. Error  z Pr(>|z|)  2.5 % 97.5 %
         15   0.0849    0.00386 22   <0.001 0.0774 0.0925

Term: man
Type: response
Comparison: yes - no

The gender gap now is 8.49 percentage points. The gap is larger because the overall probabilities of donating to a Republican candidate are low among professors, but higher in red districts than blue districts. The closer the predicted probabilities in question are to .5, the bigger the difference between two groups will be. The closer the probabilities are to 0 or 1, the smaller the differences will be.

Summarizing predicted change over a sample

In the logit model, the change in probability as one explanatory variable changes depends on the value of the other explanatory variables in the model. For our example, the difference:

\[ \widehat{\Pr}(\mathtt{Rep}=1|\mathtt{cpvi}, \mathtt{man}=\textrm{yes}) - \widehat{\Pr}(\mathtt{Rep}=1|\mathtt{cpvi}, \mathtt{man}=\textrm{no}) \]

varies depending on the values of \(\mathtt{cpvi}\). Nevertheless, we may often want a simple, summary characterization of the relationship between our key explanatory variable and the predicted probability of the outcome.

Predicted change at the mean

If we were going to pick one value of \(\mathtt{cpvi}\) to provide one example of how the probability differs by gender, we might pick the mean.

In the marginaleffects package, when we exclude a quantitative variable from the datagrid() specification, it will be set to the mean.

comparisons(model_donors, variables = "man",
            newdata = datagrid(
            ))


 Estimate Std. Error    z Pr(>|z|)  2.5 % 97.5 %
   0.0476    0.00208 22.9   <0.001 0.0435 0.0516

Term: man
Type: response
Comparison: yes - no

We might interpret the result as:

For the average congressional district, men professors are 4.7 percentage points more likely to have donated to a Republican (vs. Democratic) candidate in 2024 than women professors.

If we had multiple explanatory variables, we might refer to a hypothetical person who is “average on all characteristics in the model” or something similar.

If the model included any categorical variables besides \(\mathtt{man}\), then there wouldn’t be a hypothetical person who could have those characteristics. The conventional approach would be to use the proportion for each categorical level (e.g., .24 for 24%). This is similar to (although not precisely the same as) taking a weighted average of the different predictions for different levels.

Some people dislike this conceptually and regard it as a substantial drawback of a predicted change at the mean [sic] interpretation whenever categorical variables are involved. I disagree, but I also recognize the matter veers into aesthetics in a way that is more like trying to argue against another person’s tastes.

In R, a bigger issue is that we cannot obtain a predicted change using proportions for categorical variables if we have fit our model with the categorical variables as factor variables. Instead, the model has to be fit as dummy variables – that is, we need to have a numeric 1/0 variable for each level of a categorical variable except for our reference category.

I will show how to do this here, but it is cumbersome so one can also skip ahead and treat this as a digression.

Advanced aside: Obtaining predicted change with categorical variables set to their proportion.

To show this, I will add ivyplus as a regressor to our earlier model. First we create a numeric dummy variable from our factor variable:

data$ivp = as.numeric(data$ivyplus == "yes")

Then, we fit the model.

model_donors2 <- glm((rep_donor == "yes") ~ man + ivp + cpvi25_red, 
                data = data, 
                family = "binomial")
tula(model_donors2)

Family: binomial / Link: logit
AIC            =  20614.282                            Number of obs   =   46510
BIC            =  20649.272                            McFadden R-sq   = 0.05508
Log likelihood = -10303.141                            Nagelkerke R-sq = 0.06812
────────────────────────────────────────────────────────────────────────────────
(rep_donor =~) │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
manyes         │     .9464      .0423      22.37    <.0001      .8635      1.029
ivp            │    -.1858     .09238     -2.011     .0443     -.3668   -.004719
cpvi25_red     │     .0307    .001295      23.72    <.0001     .02816     .03324
(Intercept)    │    -3.152     .03552     -88.75    <.0001     -3.222     -3.083
────────────────────────────────────────────────────────────────────────────────

Then, marginaleffects will also not include the mean for a 1/0 variable by default; it will still try to handle it like a factor variable. Instead, we need to use the mean() function with datagrid().

compare <- comparisons(model_donors2, 
                       variables = "man",
                       newdata = datagrid(
                         ivp = mean(data$ivp)
                       )
                       )

as_tibble(compare %>% select(cpvi25_red, ivp, estimate))

# A tibble: 1 × 3
  cpvi25_red    ivp estimate
       <int>  <dbl>    <dbl>
1         -7 0.0754   0.0477

Mean predicted change

Instead of the predicted change at the mean, one may summarize the predicted change for the sample with the mean predicted change.

One computes the discrete change for each observation based on that observation’s values for the explanatory variables. Then one takes the average of all those differences.

In R, this can be done with the marginaleffects package using the avg_comparisons() function.

avg_comparisons(model_donors, 
            variables = "man")


 Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
   0.0514     0.0022 23.3   <0.001 0.047 0.0557

Term: man
Type: response
Comparison: yes - no

The result indicates that, over the distribution of congressional districts in our data, the predicted probability of being a Republican (vs. Democratic) donor for men is on average 5.1 percentage points higher than for women.

This has not been implemented, but is what \(\mathtt{margins}\) computes by default, so is straightforward to do.

As a second example of mean predicted change, we can reconsider the model of police stops that we considered above. We can look at the mean predicted change between Blacks and Whites.

avg_comparisons(model,
            variables = list("race" = c("black", "white")))


 Estimate Std. Error     z Pr(>|z|)   2.5 % 97.5 %
  -0.0923    0.00118 -78.1   <0.001 -0.0946  -0.09

Term: race
Type: response
Comparison: white - black

. margins, dydx(2.race)   

Average marginal effects                        Number of obs     =    563,089
Model VCE    : OIM

Expression   : Pr(search_vehicle), predict()
dy/dx w.r.t. : 2.race

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      black  |   .0923258   .0011818    78.12   0.000     .0900096    .0946421
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

which we might interpret as:

Averaging over the configurations of sex, age, and time of stop in our data, the predicted probability of a search is .092 more for Blacks than it is for Whites.

Granted, with the above interpretation, it might seem strange that we are averaging over the characteristics of everyone, including people who are neither White nor Black.

Instead, we might be interested in what the average change would be for Black people; that is, how much lower on average the probability of a vehicular search would be if a Black person who is stopped were instead White.

We can do this by specifying race == "black" as a subset() to the newdata argument:

avg_comparisons(model, 
            variables = list("race" = c("black", "white")),
            newdata = subset(race == "black"))


 Estimate Std. Error     z Pr(>|z|)  2.5 %  97.5 %
   -0.101    0.00127 -79.4   <0.001 -0.103 -0.0982

Term: race
Type: response
Comparison: white - black

This result may be interpreted as:

Among Black persons who are stopped, the predicted probability for Whites with comparable sex, age, and time of stop is on average 10.1 percentage points less.

Not yet implemented.

Average treatment effects for the treated and untreated: With race/ethnicity, we aren’t really talking about a variable that would be subject to an intervention – that is, we are not fitting this model with the idea that there might be a policy intervention in mind that might change someone from White to Black or vice versa. For questions that do involve a policy intervention, especially an intervention that might be called a “treatment,” one will see common references to the average treatment effect (the ATE: average change for everyone); average treatment effect for the treated (the ATT; the average change for those in the group that got the treatment), and the average treatment effect for the untreated (the ATU; the average change for those in the group that did not get the treatment).