Average discrete change for a categorical explanatory variable
For an outcome with multiple categories, we can talk about the association between a change in the explanatory variable and a change in the probability of each of those categories.
This was also the case with a binary outcome, but there it was more self-evident that, for exmaple, a .034 change in the probability that \(y=1\) implied a \(-.034\) change in the probability that \(y=0\). For the probability of one category to increase, the probability of the other must decrease by the same amount.
With an ordered outcome, the sum of the change in probability across all our categories will still be 0. But because there are multiple categories it is no longer the case that the change in the probability of any one outcome category implies exactly the opposite change in one other category.
In our Wisconsin data in which the outcome was self-reported health, for example, we can look at the difference between men and women who are average on the other three variables in the model. To make clear how this works, let’s first look at the predicted probabilities separately for men and women.
Stata: Tables of predicted probabilities. I am using the add-on command \(\mathtt{mtable}\) to compute the predicted probabilities here, which is part of the \(\mathtt{spost13\_ado}\) package that I have co-authored. \(\mathtt{mtable}\) basically just runs margins for you and arranges the output in a more readable form.
Looking at the above table, for example, we can see that for someone who is average on class rank, test score, and family background, the probability of a man saying he is in excellent health is .239, whereas for a woman her probability is .220.
The difference in the probability of saying excellent for these two cases then is .019, or about two percentage points. We may compute this difference directly in Stata using \(\texttt{margins}\):
In Stata, we are using the \(\mathtt{dydx()}\) option to the \(\mathtt{margins}\) command, which we have also used when we want marginal change. When we use \(\mathtt{dydx()}\) with a factor variable, however, \(\texttt{margins}\) provides the discrete change – that is, the change from an observation being part of the base category (in this example, female) to the estimated category (male).
We can interpreted the highlighted result as:
For 2004 WLS respondents with average test scores, high school grades, and family socioeconomic status, the probability of reporting being in excellent health is 1.9 percentage points higher for men than for women.
Stata: Important note about using \(\mathtt{dydx()}\) with categorical variables.\(\mathtt{dydx()}\) with a categorical variable only works correctly if we have specified the variable as a factor variable using the \(\mathtt{i.}varname\) syntax in the command that fits the model. If you just specify \(varname\), you’ll get the same coefficient in the model, but \(\mathtt{margins}\) will not handle it correctly.
In this example, \(\texttt{margins}\) provides the difference between men and women for an observation with the values of the explanatory variable specified using \(\texttt{at()}\).
In the output, we have estimated discrete change for each of the five categories. Because Excellent is the fifth category, the output labeled 5 is the change in the probability of answering “Excellent.”
This change in these probabilities will vary depending the values of the other explanatory variables. For example, if we set the other explanatory variables so their value is 1 instead of 0, we get a change of .024 instead of .019:
For 2004 WLS respondents whose test scores, high school grades, and family socioeconomic status are all a standard deviation above the mean, the probability of reporting being in excellent health is 2.4 percentage points higher for men than for women.
As elsewhere, then, we may be interested in providing a simple overall summary of the changes over the discrete variables. Earlier, we provided the change at the mean of the other explanatory variables, which I think works well in this application because the other variables are all continuous and it is easy to imagine two cases that are the same on these variables but only differ in their sex.
However, we could also do the average discrete change. For every case, we would compute the difference in the predicted probabilty for \(\texttt{male} = 1\) and \(\texttt{male} = 0\), given the values of the other explanatory variables for that case. Then we would average all these differences.
The average discrete change here is .019, and similar (but not identical) to the discrete change at the means that we computed before. We could interpret this as:
Net of class rank, test score, and family background, men are on average about two percentage points more likely to report being in excellent health.
“On average” is maybe a little bit of a cheat here for “averaging over the observed values of the explanatory variables,” but seems harmless enough.
Example using a multicategory explanatory variable
In this example, we are using the General Social Survey data in which the outcome is attitudes on national health spending (1 = “too little”, 2 =“about right”, 3 = “too much”). We have several explanatory variables, one of which is \(\texttt{race4cat}\), where the categories are White(\(=1\)), Black(\(=2\)), Latino(\(=3\)) and other(\(=4\)).
library(tidyverse)library(haven)library(MASS)library(marginaleffects)# Read and prepare datadf <-read_dta("../dta/gss_natheal.dta") %>%filter(as.integer(natheal) <=3) %>%# keep only too little/about right/too muchmutate(natheal =factor(as_factor(natheal),levels =c("too little", "about right", "too much")),year =as_factor(year),party =relevel(as_factor(party), ref ="indep"),female =as_factor(female),race4cat =relevel(as_factor(race4cat), ref ="black")) %>%drop_na(race4cat, party)# Fit ordered logit model with weightsmodel <-polr(natheal ~ year + party + female + race4cat + age,data = df, Hess =TRUE, method ="logistic",weights = df$wtssall)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
# Average discrete change for race4catavg_marg_effect <-avg_slopes(model, variables ="race4cat")avg_marg_effect
Group Contrast Estimate Std. Error z Pr(>|z|) S 2.5 %
too little latino - black -0.1431 0.01593 -8.98 <0.001 61.7 -0.1743
too little other - black -0.1508 0.02287 -6.59 <0.001 34.4 -0.1956
too little white - black -0.0644 0.01258 -5.12 <0.001 21.6 -0.0890
about right latino - black 0.0943 0.01062 8.88 <0.001 60.3 0.0734
about right other - black 0.0988 0.01435 6.88 <0.001 37.3 0.0707
about right white - black 0.0444 0.00893 4.97 <0.001 20.5 0.0269
too much latino - black 0.0489 0.00566 8.63 <0.001 57.2 0.0378
too much other - black 0.0520 0.00883 5.88 <0.001 27.9 0.0346
too much white - black 0.0199 0.00371 5.38 <0.001 23.7 0.0127
97.5 %
-0.1119
-0.1059
-0.0397
0.1151
0.1269
0.0619
0.0599
0.0693
0.0272
Term: race4cat
Type: probs
The reference category in the example is Black (non-Latino), so the other results are changes in probabilities between other race/ethnic categories and Black.
Here is how I might interpret the highlighted result:
Net of other variables, White GSS respondents are less likely than Black respondents to say the US is spending too little on health. Averaging over all respondents, the difference is about 6.5 percentage points.