Average discrete change for a categorical explanatory variable

For an outcome with multiple categories, we can talk about the association between a change in the explanatory variable and a change in the probability of each of those categories.

This was also the case with a binary outcome, but there it was more self-evident that, for exmaple, a .034 change in the probability that \(y=1\) implied a \(-.034\) change in the probability that \(y=0\). For the probability of one category to increase, the probability of the other must decrease by the same amount.

With an ordered outcome, the sum of the change in probability across all our categories will still be 0. But because there are multiple categories it is no longer the case that the change in the probability of any one outcome category implies exactly the opposite change in one other category.

In our Wisconsin data in which the outcome was self-reported health, for example, we can look at the difference between men and women who are average on the other three variables in the model. To make clear how this works, let’s first look at the predicted probabilities separately for men and women.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(marginaleffects)

# Read data and prepare for analysis
df <- read_dta("../dta/wlshealth.dta") %>%
  filter(!is.na(health04)) %>%
  mutate(health04 = as_factor(health04),
         male = 1 - female,
         male = as_factor(male))

# Fit ordered logit model
model <- polr(health04 ~ male + z_ses57 + z_classrank + hn1,
              data = df, Hess = TRUE, method = "logistic")

# Get predicted probabilities for women (male = 0) and men (male = 1)
# at mean values of other covariates
pred_women <- avg_predictions(model,
                              newdata = datagrid(male = "0",
                                               z_classrank = 0,
                                               hn1 = 0,
                                               z_ses57 = 0))
pred_men <- avg_predictions(model,
                            newdata = datagrid(male = "1",
                                             z_classrank = 0,
                                             hn1 = 0,
                                             z_ses57 = 0))

# Display results in table format
pred_table <- bind_rows(
  as.data.frame(pred_women) %>% mutate(sex = "Women"),
  as.data.frame(pred_men) %>% mutate(sex = "Men")
) %>%
  dplyr::select(sex, group, estimate) %>%
  pivot_wider(names_from = group, values_from = estimate)

pred_table

# A tibble: 2 × 6
  sex     poor   fair  good `very good` excellent
  <chr>  <dbl>  <dbl> <dbl>       <dbl>     <dbl>
1 Women 0.0230 0.0797 0.291       0.386     0.220
2 Men   0.0206 0.0723 0.275       0.393     0.239


. use ../dta/wlshealth.dta, clear

. gen male = 1 - female
(4 missing values generated)

. quietly ologit health04 i.male z_ses57 z_classrank hn1

. quietly mtable, at(male = 0 z_classrank = 0 hn1 = 0 z_ses57 = 0) clear rowname(Women) noatvars

. mtable, at(male = 1 z_classrank = 0 hn1 = 0 z_ses57 = 0) below rowname(Men) noatvars

Expression: Pr(health04), predict(outcome())

           |     poor      fair      good  very_good  excellent
 ----------+---------------------------------------------------
     Women |    0.023     0.080     0.291      0.386      0.220
       Men |    0.021     0.072     0.275      0.393      0.239

Stata: Tables of predicted probabilities. I am using the add-on command \(\mathtt{mtable}\) to compute the predicted probabilities here, which is part of the \(\mathtt{spost13\_ado}\) package that I have co-authored. \(\mathtt{mtable}\) basically just runs margins for you and arranges the output in a more readable form.

Looking at the above table, for example, we can see that for someone who is average on class rank, test score, and family background, the probability of a man saying he is in excellent health is .239, whereas for a woman her probability is .220.

The difference in the probability of saying excellent for these two cases then is .019, or about two percentage points. We may compute this difference directly in Stata using \(\texttt{margins}\):

library(tidyverse)
library(haven)
library(MASS)
library(marginaleffects)

# Read and prepare data
df <- read_dta("../dta/wlshealth.dta") %>%
  filter(!is.na(health04)) %>%
  mutate(health04 = as_factor(health04),
         male = 1 - female,
         male = as_factor(male))

# Fit ordered logit model
model <- polr(health04 ~ male + z_ses57 + z_classrank + hn1,
              data = df, Hess = TRUE, method = "logistic")

# Compute discrete change (marginal effect) of male at specific values
marg_effect <- avg_slopes(model,
                          newdata = datagrid(male = unique,
                                            z_classrank = 0,
                                            hn1 = 0,
                                            z_ses57 = 0),
                          variables = "male")

marg_effect


     Group Estimate Std. Error     z Pr(>|z|)   S    2.5 %    97.5 %
 poor      -0.00233   0.000968 -2.41   0.0161 6.0 -0.00422 -0.000431
 fair      -0.00731   0.003002 -2.43   0.0149 6.1 -0.01319 -0.001424
 good      -0.01614   0.006628 -2.43   0.0149 6.1 -0.02913 -0.003147
 very good  0.00648   0.002660  2.43   0.0149 6.1  0.00126  0.011689
 excellent  0.01930   0.007936  2.43   0.0150 6.1  0.00374  0.034852

Term: male
Type: probs
Comparison: 1 - 0


. margins, dydx(male) at(z_classrank = 0 hn1 = 0 z_ses57 = 0) noatlegend

Conditional marginal effects                             Number of obs = 7,221
Model VCE: OIM

dy/dx wrt: 1.male

1._predict: Pr(health04==1), predict(pr outcome(1))
2._predict: Pr(health04==2), predict(pr outcome(2))
3._predict: Pr(health04==3), predict(pr outcome(3))
4._predict: Pr(health04==4), predict(pr outcome(4))
5._predict: Pr(health04==5), predict(pr outcome(5))

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
0.male       |  (base outcome)
-------------+----------------------------------------------------------------
1.male       |
    _predict |
          1  |  -.0023281   .0009677    -2.41   0.016    -.0042247   -.0004314
          2  |   -.007308    .003002    -2.43   0.015    -.0131919   -.0014241
          3  |  -.0161365   .0066276    -2.43   0.015    -.0291264   -.0031466
          4  |   .0064753     .00266     2.43   0.015     .0012617    .0116889
          5  |   .0192973    .007936     2.43   0.015     .0037431    .0348515
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

In Stata, we are using the \(\mathtt{dydx()}\) option to the \(\mathtt{margins}\) command, which we have also used when we want marginal change. When we use \(\mathtt{dydx()}\) with a factor variable, however, \(\texttt{margins}\) provides the discrete change – that is, the change from an observation being part of the base category (in this example, female) to the estimated category (male).

We can interpreted the highlighted result as:

For 2004 WLS respondents with average test scores, high school grades, and family socioeconomic status, the probability of reporting being in excellent health is 1.9 percentage points higher for men than for women.

Stata: Important note about using \(\mathtt{dydx()}\) with categorical variables. \(\mathtt{dydx()}\) with a categorical variable only works correctly if we have specified the variable as a factor variable using the \(\mathtt{i.}varname\) syntax in the command that fits the model. If you just specify \(varname\), you’ll get the same coefficient in the model, but \(\mathtt{margins}\) will not handle it correctly.

In this example, \(\texttt{margins}\) provides the difference between men and women for an observation with the values of the explanatory variable specified using \(\texttt{at()}\).

In the output, we have estimated discrete change for each of the five categories. Because Excellent is the fifth category, the output labeled 5 is the change in the probability of answering “Excellent.”

This change in these probabilities will vary depending the values of the other explanatory variables. For example, if we set the other explanatory variables so their value is 1 instead of 0, we get a change of .024 instead of .019:

library(tidyverse)
library(haven)
library(MASS)
library(marginaleffects)

# Read and prepare data
df <- read_dta("../dta/wlshealth.dta") %>%
  filter(!is.na(health04)) %>%
  mutate(health04 = as_factor(health04),
         male = 1 - female,
         male = as_factor(male))

# Fit ordered logit model
model <- polr(health04 ~ male + z_ses57 + z_classrank + hn1,
              data = df, Hess = TRUE, method = "logistic")

# Discrete change at values 1 SD above mean
marg_effect <- avg_slopes(model,
                          newdata = datagrid(male = unique,
                                            z_classrank = 1,
                                            hn1 = 1,
                                            z_ses57 = 1),
                          variables = "male")

marg_effect


     Group Estimate Std. Error     z Pr(>|z|)   S    2.5 %    97.5 %
 poor      -0.00144   0.000594 -2.42   0.0157 6.0 -0.00260 -2.71e-04
 fair      -0.00488   0.001996 -2.45   0.0145 6.1 -0.00879 -9.69e-04
 good      -0.01531   0.006266 -2.44   0.0145 6.1 -0.02759 -3.03e-03
 very good -0.00249   0.001279 -1.95   0.0513 4.3 -0.00500  1.46e-05
 excellent  0.02412   0.009938  2.43   0.0152 6.0  0.00464  4.36e-02

Term: male
Type: probs
Comparison: 1 - 0


. margins, dydx(male) at(z_classrank = 1 hn1 = 1 z_ses57 = 1) noatlegend

Conditional marginal effects                             Number of obs = 7,221
Model VCE: OIM

dy/dx wrt: 1.male

1._predict: Pr(health04==1), predict(pr outcome(1))
2._predict: Pr(health04==2), predict(pr outcome(2))
3._predict: Pr(health04==3), predict(pr outcome(3))
4._predict: Pr(health04==4), predict(pr outcome(4))
5._predict: Pr(health04==5), predict(pr outcome(5))

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
0.male       |  (base outcome)
-------------+----------------------------------------------------------------
1.male       |
    _predict |
          1  |  -.0014356   .0005941    -2.42   0.016    -.0025999   -.0002712
          2  |  -.0048821   .0019963    -2.45   0.014    -.0087948   -.0009694
          3  |  -.0153108   .0062663    -2.44   0.015    -.0275925   -.0030292
          4  |  -.0024917   .0012787    -1.95   0.051    -.0049979    .0000146
          5  |   .0241201   .0099379     2.43   0.015     .0046423     .043598
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

This result would be interpreted as:

For 2004 WLS respondents whose test scores, high school grades, and family socioeconomic status are all a standard deviation above the mean, the probability of reporting being in excellent health is 2.4 percentage points higher for men than for women.

As elsewhere, then, we may be interested in providing a simple overall summary of the changes over the discrete variables. Earlier, we provided the change at the mean of the other explanatory variables, which I think works well in this application because the other variables are all continuous and it is easy to imagine two cases that are the same on these variables but only differ in their sex.

However, we could also do the average discrete change. For every case, we would compute the difference in the predicted probabilty for \(\texttt{male} = 1\) and \(\texttt{male} = 0\), given the values of the other explanatory variables for that case. Then we would average all these differences.

library(tidyverse)
library(haven)
library(MASS)
library(marginaleffects)

# Read and prepare data
df <- read_dta("../dta/wlshealth.dta") %>%
  filter(!is.na(health04)) %>%
  mutate(health04 = as_factor(health04),
         male = 1 - female,
         male = as_factor(male))

# Fit ordered logit model
model <- polr(health04 ~ male + z_ses57 + z_classrank + hn1,
              data = df, Hess = TRUE, method = "logistic")

# Average discrete change across entire sample
avg_marg_effect <- avg_slopes(model, variables = "male")

avg_marg_effect


     Group Estimate Std. Error     z Pr(>|z|)   S    2.5 %    97.5 %
 poor      -0.00241    0.00100 -2.40   0.0163 5.9 -0.00438 -0.000443
 fair      -0.00733    0.00301 -2.43   0.0150 6.1 -0.01324 -0.001427
 good      -0.01516    0.00621 -2.44   0.0146 6.1 -0.02733 -0.002987
 very good  0.00548    0.00224  2.45   0.0144 6.1  0.00109  0.009873
 excellent  0.01942    0.00798  2.43   0.0150 6.1  0.00377  0.035063

Term: male
Type: probs
Comparison: 1 - 0


. margins, dydx(male)

Average marginal effects                                 Number of obs = 7,221
Model VCE: OIM

dy/dx wrt: 1.male

1._predict: Pr(health04==1), predict(pr outcome(1))
2._predict: Pr(health04==2), predict(pr outcome(2))
3._predict: Pr(health04==3), predict(pr outcome(3))
4._predict: Pr(health04==4), predict(pr outcome(4))
5._predict: Pr(health04==5), predict(pr outcome(5))

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
0.male       |  (base outcome)
-------------+----------------------------------------------------------------
1.male       |
    _predict |
          1  |  -.0024112   .0010041    -2.40   0.016    -.0043791   -.0004432
          2  |  -.0073319   .0030129    -2.43   0.015    -.0132371   -.0014268
          3  |  -.0151567    .006209    -2.44   0.015    -.0273261   -.0029873
          4  |   .0054827   .0022401     2.45   0.014     .0010923    .0098731
          5  |   .0194171    .007983     2.43   0.015     .0037708    .0350634
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

The average discrete change here is .019, and similar (but not identical) to the discrete change at the means that we computed before. We could interpret this as:

Net of class rank, test score, and family background, men are on average about two percentage points more likely to report being in excellent health.

“On average” is maybe a little bit of a cheat here for “averaging over the observed values of the explanatory variables,” but seems harmless enough.

Example using a multicategory explanatory variable

In this example, we are using the General Social Survey data in which the outcome is attitudes on national health spending (1 = “too little”, 2 =“about right”, 3 = “too much”). We have several explanatory variables, one of which is \(\texttt{race4cat}\), where the categories are White(\(=1\)), Black(\(=2\)), Latino(\(=3\)) and other(\(=4\)).

library(tidyverse)
library(haven)
library(MASS)
library(marginaleffects)

# Read and prepare data
df <- read_dta("../dta/gss_natheal.dta") %>%
  filter(as.integer(natheal) <= 3) %>%  # keep only too little/about right/too much
  mutate(natheal = factor(as_factor(natheal),
                          levels = c("too little", "about right", "too much")),
         year = as_factor(year),
         party = relevel(as_factor(party), ref = "indep"),
         female = as_factor(female),
         race4cat = relevel(as_factor(race4cat), ref = "black")) %>%
  drop_na(race4cat, party)

# Fit ordered logit model with weights
model <- polr(natheal ~ year + party + female + race4cat + age,
              data = df, Hess = TRUE, method = "logistic",
              weights = df$wtssall)

Warning in eval(family$initialize): non-integer #successes in a binomial glm!

# Average discrete change for race4cat
avg_marg_effect <- avg_slopes(model, variables = "race4cat")

avg_marg_effect


       Group       Contrast Estimate Std. Error     z Pr(>|z|)    S   2.5 %
 too little  latino - black  -0.1431    0.01593 -8.98   <0.001 61.7 -0.1743
 too little  other - black   -0.1508    0.02287 -6.59   <0.001 34.4 -0.1956
 too little  white - black   -0.0644    0.01258 -5.12   <0.001 21.6 -0.0890
 about right latino - black   0.0943    0.01062  8.88   <0.001 60.3  0.0734
 about right other - black    0.0988    0.01435  6.88   <0.001 37.3  0.0707
 about right white - black    0.0444    0.00893  4.97   <0.001 20.5  0.0269
 too much    latino - black   0.0489    0.00566  8.63   <0.001 57.2  0.0378
 too much    other - black    0.0520    0.00883  5.88   <0.001 27.9  0.0346
 too much    white - black    0.0199    0.00371  5.38   <0.001 23.7  0.0127
  97.5 %
 -0.1119
 -0.1059
 -0.0397
  0.1151
  0.1269
  0.0619
  0.0599
  0.0693
  0.0272

Term: race4cat
Type: probs


. use ../dta/gss_natheal, clear
(General Social Survey 2000-2018 extract)

. quietly ologit natheal i.year ib2.party ///
>         i.female ib2.race4cat age [pweight=wtssall], nolog vsquish allbaselevels

. margins, dydx(race4cat)

Average marginal effects                                Number of obs = 11,901
Model VCE: Robust

dy/dx wrt: 1.race4cat 3.race4cat 4.race4cat

1._predict: Pr(natheal==1), predict(pr outcome(1))
2._predict: Pr(natheal==2), predict(pr outcome(2))
3._predict: Pr(natheal==3), predict(pr outcome(3))

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
1.race4cat   |
    _predict |
          1  |  -.0646305   .0137264    -4.71   0.000    -.0915338   -.0377273
          2  |   .0445747   .0097511     4.57   0.000      .025463    .0636864
          3  |   .0200559   .0040495     4.95   0.000     .0121191    .0279926
-------------+----------------------------------------------------------------
2.race4cat   |  (base outcome)
-------------+----------------------------------------------------------------
3.race4cat   |
    _predict |
          1  |  -.1436256    .018845    -7.62   0.000    -.1805612   -.1066901
          2  |   .0944851   .0124224     7.61   0.000     .0701376    .1188326
          3  |   .0491405   .0068169     7.21   0.000     .0357797    .0625013
-------------+----------------------------------------------------------------
4.race4cat   |
    _predict |
          1  |  -.1512815   .0287431    -5.26   0.000     -.207617    -.094946
          2  |   .0990152    .017778     5.57   0.000     .0641711    .1338594
          3  |   .0522662   .0112981     4.63   0.000     .0301225      .07441
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

The reference category in the example is Black (non-Latino), so the other results are changes in probabilities between other race/ethnic categories and Black.

Here is how I might interpret the highlighted result:

Net of other variables, White GSS respondents are less likely than Black respondents to say the US is spending too little on health. Averaging over all respondents, the difference is about 6.5 percentage points.