Comparing logit estimates across models

Sometimes in regression model we want to explain how much of the relationship between an key explanatory variable and outcome is accounted for by another explanatory variable.

For example, when there is a mediator, we may be interested in how much of the relationship between the explanatory variable and outcome is explained by the mediator. If our outcome is an person’s earnings and the explanatory variable is that person’s parents’ education, we might be interested in how much of the association between parents’ education and earnings is explained by their own education.

In OLS, we would answer this question by fitting two models:

  1. A baseline model that includes the key explanatory variable and other covariates but not the mediator.
  2. An full model that includes all the regressors in the baseline model plus the mediator.

We will refer to the coefficient of our key explanatory variable in the first model as \(\beta_{baseline}\) and in the second model as \(\beta_{full}\).

The percent change attributed to the mediator is just the percent change in the coefficient:

\[ \textrm{proportion accounted for by mediator} = \left(\frac{\beta_{baseline} - \beta_{full}}{\beta_{baseline}}\right) \]

If \(\beta_{baseline}\) was 2 and \(\beta_{full}\) was 1.5, then (2-1.5)/2 = .25, meaning that 25 percent of the relationship between the key explanatory variable and outcome was accounted for by the mediator.

Sometimes in causal analyses the language of total effect and indirect effect is used, where one would say that 25 percent of the total effect of the explanatory variable on the outcome is due to the indirect effect via the mediator.

In OLS, if the mediator is not correlated with the explanatory variable, then the coefficient of the explanatory variable will not change when the mediator is added to the model. In that case, \(\beta_{baseline} = \beta_{full}\), making the result for the above formula 0.

But in logit models, as we have already seen, coefficients do change when variables uncorrelated with the key explanatory variable are added to the model, as long as the added variables are associated with the outcome. Specifically, logit coefficients when uncorrelated regressors that predict the outcome become larger in maginitude.

When we talk about mediation, this rescaling may seem undesirable for accurately characterizing the extent to which the mediator accounts for the explanatory variable and the outcome. We might consider the true amount of mediation instead to reflect the change of the coefficient net of the consequence of the rescaling of the underlying latent variable.

Aside: This is not just about mediation. The basic points here do not apply just to mediation; one might be interested in how much of an association is in fact the result of a confounded. But for simplicity of language I am going to refer to the variable whose impact on the relationship between the key explanatory variable and outcome we are interested in assessing as a mediator.

Solutions

There are several ways that one could eliminate the scaling artifact.

  1. The average marginal effects of a logit model are not affected by the addition of uncorrelated regressors. So instead of looking at the percentage change in the coefficient, one could look at the percentage change in the average marginal effects.
  2. Like OLS, the coefficients of the linear probability model are not affected by the addition of uncorrelated regressors. So one could study mediation using a linear probability model rather than logit/probit.
  3. The Karlsen-Holm-Breen method, often referred to as the KHB method, involves making a residualized version of the mediating variable. The residualized mediator is uncorrelated with the key explanatory variable but retains its independent explanatory power for predicting the outcome. The residualized mediator can be used to determine how much of the change in the logit/probit coefficient of the key explanatory variable is due to the scaling difference, and so then one can “correct” the estimated change so that it is net of the scaling difference.
  4. Logit coefficients can be transformed using y-standardization to coefficients for which the standard deviation of the underlying latent variable is 1. Hence, comparing y-standardized coefficients across models does not have any scaling difference.

Below I am going to work through each of these methods. Here is the big picture of the overall lesson, though.

Big Picture points about these solutions

  1. If the idea is to explain what percentage of the relationship between the explanatory variable and outcome is explained by a third variable (e.g., a mediator), comparing logit coefficients will understate the amount of change relative to these other methods. The problem motivated a little flurry of papers about this issue: comparing logit coefficients underestimates mediation.

  2. Average marginal effects, \(y\)-standardization, and the KHB method, both in the example below and my more general experience, give quite similar results. The authors of the KHB model have done some simulations showing that their method performs better under certain conditions, but I am a bit skeptical that these differences are big enough or broad enough in practice to be worth it.

  3. The linear probability model results are going to be more different than the other methods. If you think doing the logit model vs. the linear probability model is better in the first place, it’s not clear to me why you would change your mind because of this issue given that the average marginal effects are straightforward enough to calculate.

My conclusion: If I had an application, like mediation analysis, where I needed to be able to compare the magnitude of results across models, my go-to approach would be average marginal effects.

Example

Our outcome is subjective social class identification in the 2000-2018 General Social Survey. Our question is the extent to which the relationship between a person’s education and their subjective social class identification is mediated by their household income.

Preparing the data for the example

Two notes on recoding the GSS data:

  • In the GSS, social class identification is measured using four categories: upper, middle, working, and lower. We make a dichotomous variable called \(\texttt{middleupper}\) that \(=1\) if upper or middle and \(=0\) if working or lower.
  • The variable in the GSS is household income in 1986 dollars. Because 2010 is about the midpoint between the range of years we are examining (2000-2018), I converted it to 2010 dollars instead. From online CPI calculators, I converted 1986 dollars to 2010 dollars by multiplying by 1.99; the new variable is called \(\texttt{inc2010}\).
Expand to show code for opening and recoding data
# dependencies
library(tidyverse)
library(haven)
library(marginaleffects)
library(texreg)

options(scipen = 9)

df <- read_dta("../dta/gss_soc383.dta") %>%
  filter(year >= 2000 & year <= 2018) %>%
  mutate(middleupper = case_when(
    class %in% 1:2 ~ 0,
    class %in% 3:4 ~ 1,
    TRUE ~ NA_real_
  )) %>%
  mutate(inc2010 = (realinc * 1.99)/1000) %>% # convert to 2010 dollars
  drop_na(middleupper, educ, realinc)

In the GSS, social class identification is measured using four categories: upper, middle, working, and lower. We are going to make a dichotomous variable called \(\texttt{middleupper}\) that \(=1\) if upper or middle and \(=0\) if working or lower.

. use ../dta/gss_soc383.dta, clear
(GSS 1972-2018: selected variables)

. drop if year < 2000 | year > 2018
(38,116 observations deleted)

. recode class (1/2=0) (3/4=1) (*=.), gen(middleupper)
(26698 differences between class and middleupper)

The GSS measure of real household income (\(\mathtt{realinc}\)) is measured in 1986 dollars. We will convert this to 2010 dollars – according to online sources this requires inflating by a factor of 1.99 – and we will make the metric of our new variable \(\texttt{inc2010}\) in thousands of dollars instead of dollars.

. gen inc2010 = (realinc * 1.99)/1000
(3,025 missing values generated)

. label variable inc2010 "household income in 2010 dollars (/1000)"

We will also drop cases with missing data so that we do not have to worry about the sample size changing if we add a variable with missing data to a model.

. drop if middleupper >= .
(1,692 observations deleted)

. drop if educ >= .
(43 observations deleted)

. drop if realinc >= .
(2,696 observations deleted)

Difference in logit coefficients

We fit two logit models, one without the mediator and one with the mediator.

Expand to show code for fitting models and displaying results
reduced_model <- glm(
  middleupper ~ educ,
  data = df,
  family = "binomial",
  weights = wtssall
)

full_model <- glm(
  middleupper ~ educ + inc2010,
  data = df,
  family = "binomial",
  weights = wtssall
)

# display results
screenreg(
  list(reduced_model, full_model),
  custom.model.names = c("Without income", "With income"),
  custom.coef.map = list(
    "educ" = "Years of education",
    "inc2010" = "Household income (2010 $)",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.rsquared = TRUE, 
  include.adjrs = FALSE, 
)

=========================================================
                           Without income  With income   
---------------------------------------------------------
Years of education              0.234 ***       0.153 ***
                               (0.005)         (0.006)   
Household income (2010 $)                       0.011 ***
                                               (0.000)   
Intercept                      -3.300 ***      -2.988 ***
                               (0.077)         (0.078)   
---------------------------------------------------------
AIC                         26870.454       24971.215    
BIC                         26886.476       24995.247    
Log Likelihood             -13433.227      -12482.607    
Deviance                    28230.890       26247.266    
Num. obs.                   22267           22267        
=========================================================
*** p < 0.01; ** p < 0.05; * p < 0.1
. logit middleupper educ [pweight=wtssall], nolog

Logistic regression                                    Number of obs =  22,267
                                                       Wald chi2(1)  = 1191.28
                                                       Prob > chi2   =  0.0000
Log pseudolikelihood = -14115.445                      Pseudo R2     =  0.0711

------------------------------------------------------------------------------
             |               Robust
 middleupper | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .2335749   .0067673    34.51   0.000     .2203111    .2468386
       _cons |  -3.299956   .0954463   -34.57   0.000    -3.487028   -3.112885
------------------------------------------------------------------------------

. logit middleupper educ inc2010 [pweight=wtssall], nolog

Logistic regression                                    Number of obs =  22,267
                                                       Wald chi2(2)  = 1716.08
                                                       Prob > chi2   =  0.0000
Log pseudolikelihood = -13123.633                      Pseudo R2     =  0.1363

------------------------------------------------------------------------------
             |               Robust
 middleupper | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1531725   .0071114    21.54   0.000     .1392345    .1671106
     inc2010 |   .0111996   .0004168    26.87   0.000     .0103826    .0120165
       _cons |  -2.987896   .0950885   -31.42   0.000    -3.174266   -2.801526
------------------------------------------------------------------------------

The logit coefficient decreases from .2336 to .1522, a difference of 34.4% \(\left(\frac{.2336 - .1532}{.2336}\right)\).

Again, if this was OLS we might then offer the interpretation that household income accounts for 34.4% of the association and class identification. But someone might object to this statement given that logit coefficients increase when uncorrelated variables are added to the model.

Methods for estimating % mediation

Average marginal change

While the logit coefficients change when uncorrelated regressors are added to the model, the average marginal change does not. Accordingly, if we measure the % difference between models in terms of the average marginal change, the estimate will not be affected by the inclusion of uncorrelated regressors.

Expand to show code for fitting models and displaying results
reduced_ame <- avg_comparisons(
  reduced_model,
  variables = "educ",
  comparison = "dydx", 
  vcov = FALSE,
  wts = TRUE # use weights when computing average - default is FALSE!
)
reduced_ame

 Estimate
   0.0528

Term: educ
Type: response
Comparison: dY/dX
full_ame <- avg_comparisons(
  full_model,
  variables = "educ",
  comparison = "dydx", 
  vcov = FALSE,
  wts = TRUE
)
full_ame

 Estimate
   0.0316

Term: educ
Type: response
Comparison: dY/dX

The difference is .0528 vs .0318, or a decrease of (.0528 - .0318) / .0528 = 39.7%. We could interpret this as:

  • Household income accounts for 39.7% of the relationship between educational attainment and whether someone identifies as upper or middle class (vs. working or lower class).

After fitting each logit model, we calculate average marginal effects for educ by using \(\texttt{margins}\) with the \(\texttt{dydx()}\) option.

. quietly logit middleupper educ [pweight=wtssall]

. margins, dydx(educ)

Average marginal effects                                Number of obs = 22,267
Model VCE: Robust

Expression: Pr(middleupper), predict()
dy/dx wrt:  educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0527973   .0012731    41.47   0.000     .0503021    .0552925
------------------------------------------------------------------------------

. quietly logit middleupper educ inc2010 [pweight=wtssall]

. margins, dydx(educ)

Average marginal effects                                Number of obs = 22,267
Model VCE: Robust

Expression: Pr(middleupper), predict()
dy/dx wrt:  educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0316367   .0013925    22.72   0.000     .0289074     .034366
------------------------------------------------------------------------------

The difference is .05280 vs .0316, or a decrease of 40.1%. We could interpret this as:

  • Household income accounts for 40.1% of the relationship between educational attainment and whether someone identifies as upper or middle class (vs. working or lower class).

Change in linear probability model coefficients

Linear probability model coefficients are often similar to the average marginal change. So, someone interested in comparing across models might decide to eschew fitting a logit model in favor of a linear probability model.

Expand to show code for fitting models and displaying results
reduced_model <- lm(
  middleupper ~ educ,
  data = df,
  weights = wtssall
)

full_model <- lm(
  middleupper ~ educ + inc2010,
  data = df,
  weights = wtssall
)

# display results
screenreg(
  list(reduced_model, full_model),
  custom.model.names = c("Without income", "With income"),
  custom.coef.map = list(
    "educ" = "Years of education",
    "inc2010" = "Household income (2010 $)",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 4, 
  include.rsquared = TRUE, 
  include.adjrs = FALSE, 
)

=========================================================
                           Without income  With income   
---------------------------------------------------------
Years of education             0.0507 ***      0.0322 ***
                              (0.0011)        (0.0011)   
Household income (2010 $)                      0.0021 ***
                                              (0.0000)   
Intercept                     -0.2168 ***     -0.1202 ***
                              (0.0149)        (0.0144)   
---------------------------------------------------------
R^2                            0.0916          0.1676    
Num. obs.                  22267           22267         
=========================================================
*** p < 0.01; ** p < 0.05; * p < 0.1

The difference in coefficients here is .0507 vs. .0322, or a difference of 36.4% We could offer the same interpretation as above, only using 36.4 instead of the earlier result.

. regress middleupper educ [pweight=wtssall]
(sum of wgt is 21,967.0887912027)

Linear regression                               Number of obs     =     22,267
                                                F(1, 22265)       =    1792.78
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0916
                                                Root MSE          =     .47589

------------------------------------------------------------------------------
             |               Robust
 middleupper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0507396   .0011984    42.34   0.000     .0483908    .0530885
       _cons |  -.2167826    .016873   -12.85   0.000    -.2498549   -.1837104
------------------------------------------------------------------------------

. regress middleupper educ inc2010 [pweight=wtssall]
(sum of wgt is 21,967.0887912027)

Linear regression                               Number of obs     =     22,267
                                                F(2, 22264)       =    2293.40
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1676
                                                Root MSE          =     .45556

------------------------------------------------------------------------------
             |               Robust
 middleupper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0322284   .0012773    25.23   0.000     .0297248     .034732
     inc2010 |   .0020581   .0000494    41.70   0.000     .0019614    .0021548
       _cons |   -.120162   .0164134    -7.32   0.000    -.1523335   -.0879905
------------------------------------------------------------------------------

The difference in coefficients here is .05074 vs. .03223, or a difference of 36.5% We could offer the same interpretation as above, only using this instead of the 40.1% that was above.

KHB method

The KHB method (for Karlson, Holm, and Breen) involves first generating a measure of income (the mediator) that is residualized on education (the key explanatory variable). That is, we use OLS regression to predict income from education. Residualized income is then the difference between observed income and the predicted income from this regression.

# ols predicting income from education
ols <- lm(inc2010 ~ educ, data = df, weights = wtssall)

# create new variable that is residualized income
df <- df %>%
  mutate(inc2010_resid = predict(ols) - inc2010)

While income and education are correlated, residualized income and education are not.

library(weights) # for obtaining weighted correlations

# show that inc2010 and educ are correlated
wtd.cor(x = df$inc2010, y = df$educ, weight = df$wtssall)
  correlation     std.err  t.value p.value
Y   0.3718302 0.006221246 59.76781       0
# show that inc2010 and educ are not correlated
wtd.cor(x = df$inc2010_resid, y = df$educ, weight = df$wtssall)
  correlation     std.err               t.value p.value
Y 2.86596e-15 0.006701757 0.0000000000004276431       1

If we include residualized income in a logit model with education, the coefficient of education will be larger than the logit coefficient because we are adding uncorrelated information that predicts the outcome.

Expand to show code for fitting models and displaying results
reduced_model <- glm(
  middleupper ~ educ,
  data = df,
  family = "binomial",
  weights = wtssall
)

 resid_model <- glm(
  middleupper ~ educ + inc2010_resid,
  data = df,
  family = "binomial",
  weights = wtssall
)

# display results
screenreg(
  list(reduced_model, resid_model),
  custom.model.names = c("Bivariate", "With resid. income"),
  custom.coef.map = list(
    "educ" = "Years of education",
    "inc2010_resid" = "Residualized income",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.aic = FALSE, 
  include.bic = FALSE, 
  include.deviance = FALSE
)

=======================================================
                     Bivariate       With resid. income
-------------------------------------------------------
Years of education        0.234 ***       0.254 ***    
                         (0.005)         (0.006)       
Residualized income                      -0.011 ***    
                                         (0.000)       
Intercept                -3.300 ***      -3.514 ***    
                         (0.077)         (0.079)       
-------------------------------------------------------
Log Likelihood       -13433.227      -12482.607        
Num. obs.             22267           22267            
=======================================================
*** p < 0.01; ** p < 0.05; * p < 0.1

We can think of the model with residualized income as indicating what the logit coefficient would have been if income was uncorrelated with education but still predicted the outcome. In effect, we will use the coefficient of the .254 from the second model in the table instead of the .234 in order to assess the decrease in the coefficient when (unresidualized) income is added to the logit model.

. regress inc2010 educ [pweight=wtssall]
(sum of wgt is 21,967.0887912027)

Linear regression                               Number of obs     =     22,267
                                                F(1, 22265)       =    2283.74
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1383
                                                Root MSE          =     66.884

------------------------------------------------------------------------------
             |               Robust
     inc2010 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |    8.99433   .1882111    47.79   0.000     8.625423    9.363237
       _cons |  -46.94649   2.415632   -19.43   0.000     -51.6813   -42.21168
------------------------------------------------------------------------------

. predict yhat
(option xb assumed; fitted values)

. gen inc2010_resid = yhat - inc2010

As the remaining step, we will compare logit estimates in which residualized vs. unresidualized income are included along with education in the model.

Expand to show code for fitting models and displaying results
full_model <- glm(
  middleupper ~ educ + inc2010,
  data = df,
  family = "binomial",
  weights = wtssall
)

# display results
screenreg(
  list(resid_model, full_model),
  custom.model.names = c("Resid. income", "Unresid. income"),
  custom.coef.map = list(
    "educ" = "Years of education",
    "inc2010_resid" = "Residualized income",
    "inc2010" = "(Unresidualized) income",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.aic = FALSE, 
  include.bic = FALSE, 
  include.deviance = FALSE
)

========================================================
                         Resid. income   Unresid. income
--------------------------------------------------------
Years of education            0.254 ***       0.153 *** 
                             (0.006)         (0.006)    
Residualized income          -0.011 ***                 
                             (0.000)                    
(Unresidualized) income                       0.011 *** 
                                             (0.000)    
Intercept                    -3.514 ***      -2.988 *** 
                             (0.079)         (0.078)    
--------------------------------------------------------
Log Likelihood           -12482.607      -12482.607     
Num. obs.                 22267           22267         
========================================================
*** p < 0.01; ** p < 0.05; * p < 0.1

The decrease is (.254 - .153) / .254, or 39.8%

Then we compare the coefficient for education for a model that includes residualized income and the coefficient for eduacation for a model that includes the actual income measure.

. logit middleupper educ inc2010_resid [pweight=wtssall], nolog

Logistic regression                                    Number of obs =  22,267
                                                       Wald chi2(2)  = 1716.08
                                                       Prob > chi2   =  0.0000
Log pseudolikelihood = -13123.633                      Pseudo R2     =  0.1363

-------------------------------------------------------------------------------
              |               Robust
  middleupper | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
         educ |   .2539052    .006924    36.67   0.000     .2403344    .2674759
inc2010_resid |  -.0111996   .0004168   -26.87   0.000    -.0120165   -.0103826
        _cons |  -3.513677   .0957847   -36.68   0.000    -3.701411   -3.325942
-------------------------------------------------------------------------------

. logit middleupper educ inc2010 [pweight=wtssall], nolog

Logistic regression                                    Number of obs =  22,267
                                                       Wald chi2(2)  = 1716.08
                                                       Prob > chi2   =  0.0000
Log pseudolikelihood = -13123.633                      Pseudo R2     =  0.1363

------------------------------------------------------------------------------
             |               Robust
 middleupper | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1531725   .0071114    21.54   0.000     .1392345    .1671106
     inc2010 |   .0111996   .0004168    26.87   0.000     .0103826    .0120165
       _cons |  -2.987896   .0950885   -31.42   0.000    -3.174266   -2.801526
------------------------------------------------------------------------------

The difference in coefficients is .2539 vs. .1532, or 39.7%.

Change in y-standardized coefficients

Note: Not yet implemented in R.

We can compute the y-standardized coefficients in Stata using the add-on command \(\mathtt{listcoef}\). We run into a slight problem, however, as listcoef only reports results to 3 decimal places. However, results from \(\mathtt{listcoef}\)’s calculations are stored after the command is run in a matrix called \(\mathtt{r(table)}\), and so we can use that to get the four significant digits we have been using in our earlier calculations.

. quietly logit middleupper educ [pweight=wtssall]

. listcoef, std
note: pweights not compatible with summarize; weights treated as aweights.

logit (N=22267): Unstandardized and standardized estimates 

  Observed SD:  0.4993
    Latent SD:  1.9426

-------------------------------------------------------------------------------
             |         b        z    P>|z|    bStdX    bStdY   bStdXY     SDofX
-------------+-----------------------------------------------------------------
        educ |    0.2336   34.515    0.000    0.696    0.120    0.358     2.979
    constant |   -3.3000  -34.574    0.000        .        .        .         .
-------------------------------------------------------------------------------

. mat list r(table)

r(table)[2,7]
                   b           z       P>|z|       bStdX       bStdY      bStdXY       SDofX
    educ   .23357486   34.514975   4.78e-261   .69571051    .1202353   .35812484   2.9785334
constant  -3.2999562  -34.573959   6.22e-262           .           .           .           .

. quietly logit middleupper educ inc2010 [pweight=wtssall]

. listcoef, std
note: pweights not compatible with summarize; weights treated as aweights.

logit (N=22267): Unstandardized and standardized estimates 

  Observed SD:  0.4993
    Latent SD:  2.1031

-------------------------------------------------------------------------------
             |         b        z    P>|z|    bStdX    bStdY   bStdXY     SDofX
-------------+-----------------------------------------------------------------
        educ |    0.1532   21.539    0.000    0.456    0.073    0.217     2.979
     inc2010 |    0.0112   26.868    0.000    0.807    0.005    0.384    72.049
    constant |   -2.9879  -31.422    0.000        .        .        .         .
-------------------------------------------------------------------------------

. mat list r(table)

r(table)[3,7]
                   b           z       P>|z|       bStdX       bStdY      bStdXY       SDofX
    educ   .15317255   21.539081   6.70e-103   .45622954    .0728329   .21693522   2.9785334
 inc2010   .01119957   26.868427   5.14e-159   .80691507   .00532535    .3836847    72.04877
constant  -2.9878963  -31.422282   1.00e-216           .           .           .           .

The \(\texttt{listcoef}\) output would lead us to calculate a change of .120 vs. .078, which is 35.0%. But, using the expanded precision from looking at more decimal places we get a difference of .12024 vs. .07283, or 39.4%.