Computing and Comparing Predicted Probabilities

The most intuitive way for audiences to think about patterns in a binary outcome is in terms of probabilities. The drawbacks we noted with the linear probability model weren’t about the probability part; they were about the premise that changes in those probabilities would be linear.

Introducing our example

We are looking at data on vehicular stops by the San Francisco Police Department from 2007-2016, from the Stanford Open Policing Project. Our extract includes only those stops for which the reason for the stop was coded as a “moving violation.”

The explanatory variables in our example will be sex, race, age, and the time of day of the stop. Age is missing in many cases, and in these cases we code \(\texttt{age}\) as its mean and \(\texttt{age\_missing}\) as 1.

Why did we recode missing values for age to the mean with a missing variable flag? I provide an explanation here.

Expand for code that loads packages and data, and recodes some variables
library(tidyverse)
library(haven)
library(tulaverse)

# import stata data with haven
df <- read_dta("../dta/sfpd_mv.dta")

# R: need to convert variables that imported as labeled numeric to factor variables
df <- df %>%
  mutate(sex = as_factor(male)) %>%
  mutate(race = as_factor(race)) %>%
  mutate(tod = as_factor(timeofday)) %>%
  mutate(race = relevel(race, ref="white")) %>%
  mutate(tod = relevel(tod, ref = "9am-12pm"))

# mean impute missing values for age

df$age[is.na(df$age)] <- mean(df$age, na.rm = TRUE)
model <- glm(search_vehicle ~ sex + race + tod + age + age_missing, 
             data = df, family = "binomial")
tula(model)
Family: binomial / Link: logit
AIC            = 185499.309                             Number of obs   = 563089
BIC            = 185667.927                             McFadden R-sq   = 0.1169
Log likelihood = -92734.655                             Nagelkerke R-sq = 0.1371
────────────────────────────────────────────────────────────────────────────────
search_vehicle │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
sex            │                                                                
  male         │     .7066     .01719       41.1    <.0001      .6729      .7403
race           │                                                                
  asian/paci~r │    -.4847     .02758     -17.57    <.0001     -.5387     -.4306
  black        │     1.594     .01694      94.12    <.0001      1.561      1.627
  hispanic     │     1.136     .01881      60.37    <.0001      1.099      1.173
  other        │    -.1643     .02696     -6.092    <.0001     -.2171     -.1114
tod            │                                                                
  12am-3am     │     .7363     .02814      26.16    <.0001      .6812      .7915
  3am-6am      │     .7169     .04061      17.65    <.0001      .6373      .7965
  6am-9am      │   -.08343     .03515     -2.374     .0176     -.1523    -.01454
  12pm-3pm     │     .3093     .02719      11.38    <.0001       .256      .3626
  3pm-6pm      │     .4024     .02616      15.38    <.0001      .3511      .4537
  6pm-9pm      │     .4957     .02653      18.68    <.0001      .4437      .5477
  9pm-12am     │     .5703     .02585      22.07    <.0001      .5197       .621
age            │   -.03215   .0005883     -54.64    <.0001     -.0333      -.031
age_missing    │    -.1768     .02967     -5.961    <.0001      -.235     -.1187
(Intercept)    │    -3.309     .03419      -96.8    <.0001     -3.376     -3.242
────────────────────────────────────────────────────────────────────────────────
. logit search_vehicle i.male ib5.race ib4.timeofday age i.age_missing, allbaselevels Iteration 0: log likelihood = -105015.82 Iteration 1: log likelihood = -97990.252 Iteration 2: log likelihood = -92803.999 Iteration 3: log likelihood = -92734.741 Iteration 4: log likelihood = -92734.655 Iteration 5: log likelihood = -92734.655 Logistic regression Number of obs = 563,089 LR chi2(14) = 24562.33 Prob > chi2 = 0.0000 Log likelihood = -92734.655 Pseudo R2 = 0.1169 ----------------------------------------------------------------------------------------- search_vehicle | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------+---------------------------------------------------------------- male | female | 0 (base) male | .7066348 .017192 41.10 0.000 .6729392 .7403305 | race | asian/pacific islander | -.4846714 .0275788 -17.57 0.000 -.5387248 -.4306179 black | 1.594129 .0169368 94.12 0.000 1.560934 1.627325 hispanic | 1.135663 .0188131 60.37 0.000 1.09879 1.172537 other | -.164262 .0269622 -6.09 0.000 -.2171069 -.1114171 white | 0 (base) | timeofday | 12am-3am | .736314 .0281415 26.16 0.000 .6811576 .7914704 3am-6am | .7169206 .0406075 17.65 0.000 .6373314 .7965098 6am-9am | -.0834315 .0351496 -2.37 0.018 -.1523235 -.0145395 9am-12pm | 0 (base) 12pm-3pm | .3093293 .0271883 11.38 0.000 .2560412 .3626175 3pm-6pm | .4023873 .0261574 15.38 0.000 .3511197 .4536549 6pm-9pm | .4956965 .0265322 18.68 0.000 .4436945 .5476986 9pm-12am | .5703152 .0258459 22.07 0.000 .5196582 .6209723 | age | -.0321489 .0005883 -54.64 0.000 -.033302 -.0309958 | age_missing | 0 | 0 (base) 1 | -1.423639 .0352915 -40.34 0.000 -1.492809 -1.354469 | _cons | -3.309316 .0341855 -96.80 0.000 -3.376319 -3.242314 -----------------------------------------------------------------------------------------

Just from looking at the signs of the coefficients, our results indicate that men, Blacks and Hispanics, younger people, and people stopped at night are more likely to have their cars searched.

Predicted probability at specific values of x

With the logit model, the predicted probability of a positive outcome is:

\[ \widehat{\Pr}(y=1) = \frac{\exp(\mathbf{x}_i\hat{\boldsymbol{\beta}})}{\exp(\mathbf{x}_i\hat{\boldsymbol{\beta}}) + 1} \]

Given these estimates, we can compute the predicted probability of a search for any combination of explanatory variables.

For example, let’s imagine the person stopped is a 35-year-old Hispanic woman pulled over at 7pm. What is the predicted probability of her car being searched?

From our estimates, \(\mathbf{x}_i\hat{\boldsymbol{\beta}}\) would be

\[ \begin{align} \mathbf{x}_i\hat{\boldsymbol{\beta}} & = 0 + 1.135 + .496 - (.0321 \times 35) - 3.309 \\ & = -2.802 \end{align} \]

The log odds of her car being searched is −2.802, which we can convert into a predicted probability.

\[ \begin{align} \widehat{\Pr}(y=1) & = \frac{\exp(\mathbf{x}_i\hat{\boldsymbol{\beta}})}{\exp(\mathbf{x}_i\hat{\boldsymbol{\beta}}) + 1} \\ \\ & = \frac{\exp(-2.802)}{\exp(-2.802)+1} \\ \\ & = \frac{.0607}{1.0607} \\ \\ & = \mathbf{.0572} \end{align} \]

Our hypothetical woman thus has a .0572 predicted probability of being searched.

In R, rather than compute this ourselves, we will use the predictions() function from the marginaleffects library. While predictions() returns several columns for each prediction, the predicted probability itself is stored in estimate.

library(marginaleffects)
p <- predictions(model,
            newdata = datagrid(
              sex = "female",
              race = "hispanic",
              tod = "6pm-9pm",
              age = 35,
              age_missing = 0
            )
            )

p$estimate
[1] 0.05715326

In Stata, rather than compute this ourselves, we could have used the \(\mathtt{margins}\) command.

. margins, at(male=0 race=3 timeofday=7 age=35 age_missing=0) Adjusted predictions Number of obs = 563,089 Model VCE : OIM Expression : Pr(search_vehicle), predict() at : male = 0 race = 3 timeofday = 7 age = 35 age_missing = 0 ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | .0571533 .001358 42.09 0.000 .0544916 .0598149 ------------------------------------------------------------------------------

Comparing predictions for meaningful combinations of values

To illustrate the range of predicted probabilities, we can compute predictions for two contrasting but easily imagined cases.

For our results, let’s think about two contrasting cases: a 19-year-old Black man stopped at 1am, and a 65-year-old White woman stopped at 8am. At least if you are like me, even though these are totally hypothetical people, you can visualize these two folks right now, in their cars, being pulled over.

p <- predictions(model,
            newdata = datagrid(
              sex = "male",
              race = "black",
              tod = "12am-3am",
              age = 19,
              age_missing = 0
            )
            )

p$estimate
[1] 0.2925427
p <- predictions(model,
            newdata = datagrid(
              sex = "female",
              race = "white",
              tod = "6am-9am",
              age = 65,
              age_missing = 0
            )
            )

p$estimate
[1] 0.004142001

In Stata, we will use \(\mathtt{margins}\) to compute predicted probabilities.

. margins, at(male=1 race=2 timeofday=1 age=19 age_missing=0) > at(male=0 race=5 timeofday=3 age=65 age_missing=0) Adjusted predictions Number of obs = 563,089 Model VCE : OIM Expression : Pr(search_vehicle), predict() 1._at : male = 1 race = 2 timeofday = 1 age = 19 age_missing = 0 2._at : male = 0 race = 5 timeofday = 3 age = 65 age_missing = 0 ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | 1 | .2925427 .0044561 65.65 0.000 .2838089 .3012764 2 | .004142 .0001525 27.16 0.000 .0038431 .0044409 ------------------------------------------------------------------------------

Stata tip: In the output above, predicted probabilities are numbered by their \(\mathtt{\_at}\) value, with a legend above showing the explanatory variable values for each. This is helpful, but once you get into more complicated uses of \(\mathtt{margins}\) can readily result in hundreds of extra lines of output. The option \(\mathtt{noatlegend}\) tells Stata not to include this legend.

What do these results show? If a 19-year-old Black man is stopped at 1am, his predicted probability of having his vehicle searched is 29.2%. If a 65-year-old White woman is stopped at 8am, in contrast, her predicted probability of a vehicle search is less than one half of one percent. Putting it another way, the 19-year-old Black man stopped at 1am is more than 70 times more likely to be subjected to a vehicle search than a 65-year-old White woman stopped at 8am.

Extra: Why did we recode missing values for age to the mean with a missing variable flag?

Doing so allows us to keep age in the model without losing cases where age is missing. As long as we include \(\texttt{age\_missing}\) in the model, coding age to its mean will not affect any of our slope estimates. Coding it specifically to the sample mean has the additional virtue that the mean of our recoded variable will still equal the sample mean of age.

I am not necessarily commending this as a general solution. Indeed, it does mean that the coefficient for \(\texttt{age}\) will be biased, because the coefficient for \(\texttt{age}\) would be expected to be larger in magnitude if age had been accurately measured for everyone. The best way to address missing data is ultimately a big topic in its own right and outside the scope of this example.