The most intuitive way for audiences to think about patterns in a binary outcome is in terms of probabilities. The drawbacks we noted with the linear probability model weren’t about the probability part; they were about the premise that changes in those probabilities would be linear.
Introducing our example
We are looking at data on vehicular stops by the San Francisco Police Department from 2007-2016, from the Stanford Open Policing Project. Our extract includes only those stops for which the reason for the stop was coded as a “moving violation.”
The explanatory variables in our example will be sex, race, age, and the time of day of the stop. Age is missing in many cases, and in these cases we code \(\texttt{age}\) as its mean and \(\texttt{age\_missing}\) as 1.
Why did we recode missing values for age to the mean with a missing variable flag? I provide an explanation here.
Expand for code that loads packages and data, and recodes some variables
library(tidyverse)library(haven)library(tulaverse)# import stata data with havendf <-read_dta("../dta/sfpd_mv.dta")# R: need to convert variables that imported as labeled numeric to factor variablesdf <- df %>%mutate(sex =as_factor(male)) %>%mutate(race =as_factor(race)) %>%mutate(tod =as_factor(timeofday)) %>%mutate(race =relevel(race, ref="white")) %>%mutate(tod =relevel(tod, ref ="9am-12pm"))# mean impute missing values for agedf$age[is.na(df$age)] <-mean(df$age, na.rm =TRUE)
model <-glm(search_vehicle ~ sex + race + tod + age + age_missing, data = df, family ="binomial")tula(model)
Just from looking at the signs of the coefficients, our results indicate that men, Blacks and Hispanics, younger people, and people stopped at night are more likely to have their cars searched.
Predicted probability at specific values of x
With the logit model, the predicted probability of a positive outcome is:
Given these estimates, we can compute the predicted probability of a search for any combination of explanatory variables.
For example, let’s imagine the person stopped is a 35-year-old Hispanic woman pulled over at 7pm. What is the predicted probability of her car being searched?
From our estimates, \(\mathbf{x}_i\hat{\boldsymbol{\beta}}\) would be
In R, rather than compute this ourselves, we will use the predictions() function from the marginaleffects library. While predictions() returns several columns for each prediction, the predicted probability itself is stored in estimate.
library(marginaleffects)
p <-predictions(model,newdata =datagrid(sex ="female",race ="hispanic",tod ="6pm-9pm",age =35,age_missing =0 ) )p$estimate
[1] 0.05715326
In Stata, rather than compute this ourselves, we could have used the \(\mathtt{margins}\) command.
. margins, at(male=0 race=3 timeofday=7 age=35 age_missing=0)
Adjusted predictions Number of obs = 563,089
Model VCE : OIM
Expression : Pr(search_vehicle), predict()
at : male = 0
race = 3
timeofday = 7
age = 35
age_missing = 0
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .0571533 .001358 42.09 0.000 .0544916 .0598149
------------------------------------------------------------------------------
Comparing predictions for meaningful combinations of values
To illustrate the range of predicted probabilities, we can compute predictions for two contrasting but easily imagined cases.
For our results, let’s think about two contrasting cases: a 19-year-old Black man stopped at 1am, and a 65-year-old White woman stopped at 8am. At least if you are like me, even though these are totally hypothetical people, you can visualize these two folks right now, in their cars, being pulled over.
Stata tip: In the output above, predicted probabilities are numbered by their \(\mathtt{\_at}\) value, with a legend above showing the explanatory variable values for each. This is helpful, but once you get into more complicated uses of \(\mathtt{margins}\) can readily result in hundreds of extra lines of output. The option \(\mathtt{noatlegend}\) tells Stata not to include this legend.
What do these results show? If a 19-year-old Black man is stopped at 1am, his predicted probability of having his vehicle searched is 29.2%. If a 65-year-old White woman is stopped at 8am, in contrast, her predicted probability of a vehicle search is less than one half of one percent. Putting it another way, the 19-year-old Black man stopped at 1am is more than 70 times more likely to be subjected to a vehicle search than a 65-year-old White woman stopped at 8am.
Extra: Why did we recode missing values for age to the mean with a missing variable flag?
Doing so allows us to keep age in the model without losing cases where age is missing. As long as we include \(\texttt{age\_missing}\) in the model, coding age to its mean will not affect any of our slope estimates. Coding it specifically to the sample mean has the additional virtue that the mean of our recoded variable will still equal the sample mean of age.
I am not necessarily commending this as a general solution. Indeed, it does mean that the coefficient for \(\texttt{age}\) will be biased, because the coefficient for \(\texttt{age}\) would be expected to be larger in magnitude if age had been accurately measured for everyone. The best way to address missing data is ultimately a big topic in its own right and outside the scope of this example.